-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix "box model" for code blocks #1317
Conversation
I think this PR should serve as a basis for @NoahTheDuke to add the layout information necessary to reconstruct the actual layout from the content. Indeed, without a proper "box model", it will be hard to have a nice layout type. I think this layout could have the following type: type code_block_layout = {
top : string;
left : int list;
bottom : string
} And that it should be possible to reconstruct the content with the following function: let reconstruct layout content =
let lines = String.split_on_char '\n' content in
let lines = List.map2 (fun i l -> (String.make ' ' i ^ l) layout.left lines in
layout.top ^
String.concat ~sep:'\n' lines
^ layout.bottom |
I think you rather want to mostly follow what commonmark does here slightly adapted for the fact that the content can be on the line delimiting the block
|
That would be a possibility too, though more breaking (although I'm not sure how many code blocks would change in practice). That being said, I'm not sure how close this is to what commonmark does. From what I understand, commonmark only looks at the indentation of the opening fence (as mentioned in this example). In what you suggest, it's not the fence opening which determines the left margin, but first non-whitespace character, which feels quite different to me. We could also use the fence to determine left-margin. But, in
the content would become indented. I would be in favor of keeping the current box model. It's fewer changes, and only few people would relate the "common-mark like" proposal to their experience of commonmark. But I'm actually fine with both. |
Also, the box model allows to start with spaces (but not newlines), and so do commonmark. Using the first non-whitespace as left-margin does not allow that! (Although, I cannot give an example of a plausible scenario where we need to do that, I'm sure there is one!) |
Yes it is different: there is never any content on the opening fence in commonmark so you can use the marker for defining the indentation level.
You could escapes for that but I'm no longer sure how escapes work in That being said if your goal is this then you would be better off keeping a precise location of the start of content and simply keep the raw string of the content in the node. |
(And not even, if you have precise start/end locations and the source you can simply extract the part) |
Note that the problem with the current box model, is that it's not very clear. To this day I always write code fences fully flushed left to make sure I get what I want… This allows for a clear mental model in my opinion: namely something clear has to define the indentation level in the beginning of the code fence and then everything is simply left aligned on that indentation level with any whitespace before reaching the indentation level being dropped. |
Ocaml's comments are naturally indented, which makes using Commonmark's style harder. It requires deleting indentation that an editor or formatter might try to add. |
Mdx does it this way - by getting the locations (which are indeed precise) and splicing the text from the original file. |
To be clear, the only benefit to handling this is for display, right? Ocaml doesn't care about indentation, so Mdx and/or an odoc |
I think the box model is the correct way to think about code blocks (and verbatim blocks, while we're at it).
1) stdlib/float.mli puts the first line against the braces and then indents the following lines to the column where the braces sit (note the 2 space indent to everything, including the opening of the comment block): (** ...
For example, consider the following program:
{[let size = 100_000_000
let a = Float.Array.make size 1.
let update a f () =
Float.Array.iteri (fun i x -> Float.Array.set a i (f x)) a
let d1 = Domain.spawn (update a (fun x -> x +. 1.))
let d2 = Domain.spawn (update a (fun x -> 2. *. x +. 1.))
let () = Domain.join d1; Domain.join d2
]}
... *) 2) stdlib/condition.mli has a line break and then indents to the column one past the square brace: (** ...
{[
Mutex.lock m;
while not P do
Mutex.unlock m; Mutex.lock m
done;
<update the data structure>;
Mutex.unlock m
]}
... *) 3) stdlib/bytes.mli has a line break and no indentation: (** ...
{[
let string_init len f : string =
let s = Bytes.create len in
for i = 0 to len - 1 do Bytes.set s i (f i) done;
Bytes.unsafe_to_string s
]}
... *) 4) Another part of stdlib/bytes.mli dedents the braces, has no spaces or line break, and no indentation of the code block: (** ...
For example, consider the following program:
{[let size = 100_000_000
let b = Bytes.make size ' '
let update b f () =
Bytes.iteri (fun i x -> Bytes.set b i (Char.chr (f (Char.code x)))) b
let d1 = Domain.spawn (update b (fun x -> x + 1))
let d2 = Domain.spawn (update b (fun x -> 2 * x + 1))
let () = Domain.join d1; Domain.join d2
]}
... *) 5) stdlib/arg.mli dedents the braces, and indents the code block two spaces past where the text above sits (an awkward 5 spaces): (** ...
This module provides a general mechanism for extracting options and
arguments from the command line to the program. For example:
{[
let usage_msg = "append [-verbose] <file1> [<file2>] ... -o <output>"
let verbose = ref false
let input_files = ref []
let output_file = ref ""
... *) 6) stdlib/nativeint.mli has a line break and then indents the code block to align with the square brace instead of 1 past: (** ...
{[
let zero: nativeint = 0n
let one: nativeint = 1n
let m_one: nativeint = -1n
]}
... *) 7) stdlib/queue.mli has a line break and then indents to align with the curly brace: (** ...
{[
# let q = Queue.create ()
val q : '_weak1 Queue.t = <abstr>
... *) 8) stdlib/dynarray.ml (not
9) Most cursed of all, stdlib/either.mli dedents the braces, has no linebreak and then indents past the code of the initial line as if it was not on the same line as the braces: (** ...
For example:
{[List.partition_map:
('a -> ('b, 'c) Either.t) -> 'a list -> 'b list * 'c list]}
... *) (This is the only instance of this style in All of the code blocks except for 9) are consistent with the below algorithm.
|
Just thinking about the implementation, I was wondering about maybe having the raw unaltered text in the AST, then a separate function that applies the formatting rules being suggested in this thread. If we did it that way we'd avoid the need for the original source as currently required by mdx. |
I took a shot at implementing my algorithm above which led to #1318. Can be helpful to compare the two sets of outputs. |
That looks ok to me.
Nothing different here, CommonMark also does that. The only difference is that CommonMark's delimiters stand on their own line (and is thus easier to use as the indentation mark).
No, you need precise locations and layout for compilation errors of the extracted source code to be reported at the right place in the original. You need a line directive followed by the exact, verbatim, code for compilation errors to show up exactly at the right place in the source.
Unless I'm mistaken that is the case for my proposal too. Personally I don't see any benefit of using the line with the smallest indent as you suggest (rather than the first line):
Besides the same problem as @panglesd mentioned remains: how do you solve the problem of enforcing an initial indent ? For the latter I have a few ideas but I don't really like them. One would be to simply use the non newline whitespace directly following |
The reason you use the lowest indent is that in example 9), if it had multiple lines, then you'd account for additional indentation on the second line: (** ...
For example:
{[List.partition_map:
('a -> ('b, 'c) Either.t) -> 'a list -> 'b list * 'c list
List.partition_map:
('a -> ('b, 'c) Either.t) -> 'a list -> 'b list * 'c list]}
... *) renders as [%expect {|
((output
(((f.ml (2 4) (2 16))
(paragraph
(((f.ml (2 4) (2 7)) (word For)) ((f.ml (2 7) (2 8)) space)
((f.ml (2 8) (2 16)) (word example:)))))
((f.ml (4 0) (8 63))
(code_block
((f.ml (4 2) (8 61))
"List.partition_map:\
\n ('a -> ('b, 'c) Either.t) -> 'a list -> 'b list * 'c list\
\n\
\nList.partition_map:\
\n ('a -> ('b, 'c) Either.t) -> 'a list -> 'b list * 'c list")))))
(warnings ())) |}] In what scenarios do you want to enforce an initial indent in a code block? |
From a usability point of view you turn the problem of understanding how stuff is going to be indented into checking
Who knows ? There's always the odd case out there and it's enraging the day you need it. But a simple example is ASCII art.
|
Your example would be correctly handled by finding the lowest indent ( Using lowest indent: let stdlib_either =
test "
{[
+---+---+---
tape | A | A | …
+---+---+---
]}
";
[%expect {|
((output
(((f.ml (2 0) (6 2))
(code_block
((f.ml (2 2) (6 0))
" +---+---+---\
\ntape | A | A | \226\128\166\
\n +---+---+--- ")))))
(warnings ())) |}] Using first line indent: let stdlib_either =
test "
{[
+---+---+---
tape | A | A | …
+---+---+---
]}
";
[%expect {|
((output
(((f.ml (2 0) (6 2))
(code_block
((f.ml (2 2) (6 0))
"+---+---+---\
\n| A | A | \226\128\166\
\n+---+---+--- ")))))
(warnings ())) |}] |
Right, now suppose you want to write:
|
I think in the end that this is actually the more natural and less confusing overall (and allows to specify any initial indent you wish without fuss). There is the question of what happens when you have:
but you can simply say that this is equivalent to:
So basically the first column of |
That's pretty clever. Thanks for thinking it through with us. |
Thanks for all the ideas and discussion. Although this PR is "just" a fix for an already established syntax, and not an attempt to change it, it's good to have feedback on it, and we can maybe change it. I'll try to sum up the choice we have, but first:
Mdx currently changes the code blocks. But you want it to change them while preserving the layout. For instance:
should be rewritten to
so that eg it does not fight with ocamlformat which is responsible for the layout. That is why we might need a bit more than the raw string, but a standardized way of going from/to raw content/deindented content. I'll try to sum up the pros and cons of each possibility, at this point: Here are the nice properties we would like to have:
We have several candidates for that:
B. Using the first indent
C. Using the opener's indent
So I'm fine with A. or C. as long as we find out it's not going to break to many code blocks! I'll try to investigate C's breaking. |
Btw. I'm just curious are verbatim blocks treated differently from code blocks ? Because with the odoc3. I'm currently forced to write e.g.:
To make sure the verbatim block is not indented. If I write it:
|
Note that only C allows you to write this example. The problem is not about being able to indent on the first line, it's to be able to indent the whole content as you wish in the code block. |
Yes, that's what I'm asking for. As an example where we might want the raw text rather than the processed stuff, we could do a better job of error messages than mdx does today by inserting line directives, so that the line numbers matched the source file correctly. However, I don't think there's such a thing as 'column directives', so to get the column numbers to match the source file correctly you could insert each line in exactly from the source, including the indentation. |
Some more though about C: (** I'm formatted by ocamlformat
{[
hello ocamlformat
]} *) renders as If we use the last character ( (** I'm formatted by ocamlformat
{@ocaml[
let x =
match w with
| AAAAAAAAAAAAAAAAAAAAAAAA -> 1
| BBBBBBBBBBBBBBBBBBBBBBBBB -> 2
]} *) will render as
We could use as indent the position of |
Yes that's what you need if you want to use line directives.
Well this could be changed in In any case I don't think it's a good idea to indent after |
Yes, what I'm saying is more: there are currently many projects using ocamlformat, with automatically indented code blocks. Changing the indentation to be based on the opening About horizontal space: I'm slightly concerned with
being rendered as
|
Out of the example given by @NoahTheDuke at #1317 (comment), I think that examples 2, 3, 5 and 6 will get their rendering wrong:
|
Note the whole code block semantics has already been through a few changes and it's perhaps a good idea to review these discussions to what got us to what we have now. Personally I don't think the right decision was taken at the time (it seems they did not realize that you could not enforce arbitrary layouts) but perhaps it's not worth changing again. However I have to say, as far as I'm concerned, the only reliable way of working with code block and verbatims in |
This PR is ready for review:
|
Is it really worth making an exception here ? It makes the construct more complex to understand and perhaps that some explanations can be written more compactly that way. For example:
Are you sure about that one ? In the past |
I'm a bit confused! The first part of the sentence seems to say it's not worth it, and the second part seem to say that it is worth it 🫤
So, if we trim trailing spaces in code blocks, which maybe we should do, there is no distinction in how multiline/single lines content is handled. But I would still prefer to explain how they are handled "case by case".
I |
No I think we misunderstood each other You said:
To me this means But then I just find that simply saying that:
is strictly equivalent to
and then explain how:
works is the easiest. You no longer need a notion of single or multi line code block. There are just code blocks.
Not at all. It was warnings about the way I wrote things in |
Indeed! What I had understood from this comment, was that:
was considered as:
but in fact, I now think you meant that it is turned into:
So in what I understood version, the beginning was trimmed anyway, which explains why I was confused why you were dissatisfied with trimming the beginning of single line content. Thinking about it, not trimming makes sense. I'll add a commit to this PR with this behavior. |
About the warning, you wrote:
So in some case, it appears OK to raise a warning if odoc has a wrong opinion on the layout of code blocks! That being said, I'm ok to remove this warning. |
No that's a different warning. It's a warning to help people getting rid of a semantics of code blocks we'd like to get rid of. I'm more talking about warnings/errors like this one which, while being well-meaning, end up being unhelpful for all sorts of reasons (there were other occurences but I fail to find them). Note that if by:
you meant no initial and trailing trimming that issue may actually be solved by this PR. |
In the last changes, I removed the trimming for more than a whitespace only line (as per this comment, which praises a behavior that was lost...) So now, the semantics of code block content is reasonably easy to explain, and it is possible to have code blocks with arbitrary whitespace in it. I hope the PR is in a final state in term of design! Footnotes
|
I'd still like to have the raw untrimmed code block in the AST to avoid the need to keep a copy of the original source around for MDX-like uses. Or perhaps we should have that in another PR? |
I started doing that in another branch! |
Co-authored-by: Daniel Bünzli <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just some comments to change!
else | ||
match s.[index] with | ||
| ' ' | '\t' | '\r' -> handle_first_newline (index + 1) | ||
(* Multiline starting with an empty line *) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These comments are a little confusing, and could do with rewording.
Let's change the comments in your other PR based on this one, to avoid the rebase. |
TODO:
In my mental model, code blocks are supposed to follow a "box model", where the actual content of a code block is the smallest box that contains it all:
In the example above, the actual content of the code blocks is in lighter blue. As you can see, some whitespace has been stripped before, after, but also at the beginning of each line.
However, (supposing my mental model is the right one), there are two bugs in the implementation.
The first one is about lines with only whitespace, but less than the box:
Suppose that
_
are spaces (I used_
to make it visible). I think those spaces are not part of the box and should be stripped, but currently they are part of the content, which would be:The second one is about a corner case of the "box model": when the code block does not start with a new line.
In this case, it is clear what the content of the code block should be:
However, in this case it is not!
Of course,
{[
should not be part of "the box". So, what do we put instead?One possibility would be to fill it with spaces:
Another solution would be to keep the first line as it is, in this degenerated case:
The currently implemented solution is very confusing in my opinion. For the first line, it removes a number of spaces corresponding to the min between the de-indentation level and the number of opening spaces in the first line. In effect it ignores the offset in which the first line start.
The first solution seems the most consistent with the box model, but also probably the most "breaking change". The second solution seems good as:
is turned as:
and not
as it would in the first solution.
Which solution do you think is best?