- Writing a formatter has never been so easy: a Topiary tutorial
- Single-line and multi-line formatting with Topiary
In a previous post, I introduced Topiary, a universal formatter (or one could say a formatter generator), and showed how to start a formatter for a programming language from scratch. This post is the second part of the tutorial, where we’ll explore more advanced features of Topiary that come in handy when handling real-life languages, and in particular the single-line and multi-line layouts. I’ll assume that you have a working setup to format our toy Yolo language. If you don’t, please follow the relevant sections of the previous post first.
Single-line and multi-line
A fundamental tenet of formatting is that you want to lay code out in different
ways depending on if it fits on one line or not. For example, in
Nickel, or any functional programming language for that matter, it’s
idiomatic to write small anonymous functions on one line, as in std.array.map (fun x => x * 2 + 1) [1,2,3]
. But longer functions would rather look like:
fun x y z =>
if x then
y
else
z
This is true for almost any language construct that you can think of: you’d
write a small boolean condition is_a && is_b
, but write a long validation
expressions as:
std.is_string value
&& std.string.length value > 5
&& std.string.length value < 10
&& !(std.string.is_match "\\d" value)
In Rust, with rustfmt
, short method calls are formatted on one line as in
x.clone().unwrap().into()
, but they are spread over several lines when the
line length is over a fixed threshold:
value
.maybe_do_something(|x| x+1)
.or_something_else(|_| Err(()))
.into_iter()
You usually either want the single-line layout or the multi-line one. A hybrid solution wouldn’t be very consistent:
std.is_string value
&& std.string.length value > 5 && std.string.length value < 10
&& !(std.string.is_match "\\d" value)
Some formatters, such as Rust’s, choose the layout automatically depending on the length of the line. Long lines are wrapped and laid out in the multi-line style automatically, freeing the programmer from any micro decision. On the flip side, the programmer can’t force one style in cases where it’d make more sense.
Some other formatters, like our own Ormolu for Haskell, decide on the layout based on the original source code. For any syntactic construct, the programmer has two options:
- Write it on one line, or
- Write it on two lines or more.
1. will trigger the single-line layout, and 2. the multi-line one. No effort is made to try to fit within reasonable line lengths. That’s up to the programmer.
As we will see, Topiary follows the same approach as Ormolu, although future support for optional line wrapping isn’t off the table1.
Softlines
Less line breaks, please
Let’s see how our Yolo formatter handles the following source:
input income, status
output income_tax
income_tax := case { status = "exempted" => 0, _ => income * 0.2 }
Since the case
is short, we want to keep it single-line. Alas, this gets
formatted as:
input income, status
output income_tax
income_tax := case {
status = "exempted" => 0,
_ => income * 0.2
}
The simplest mechanism for multi-line-aware layout is to use soft
lines instead of spaces or hardlines. Let’s change the
@append_hardline
capture in the case branches separating
rule to @append_spaced_softline
:
; Put case branches on their own lines
(case
"," @append_spaced_softline
)
As the name indicates, a spaced softline will result in a space for the
single-line case, and a line break for the multi-line case, which is precisely
what we want. However, if we try to format our example, we get the dreaded
idempotency check failure, meaning that formatting one time or two times in a
row doesn’t give the same result, which is a usually a red flag (and is why
Topiary performs this check). What happens is that our braces {
and }
also
introduce hardlines, so the double formatting goes like:
income_tax := case { status = "exempted" => 0, _ => income * 0.2 }
--> (case is single-line: @append_spaced_softline is a space)
income_tax := case {
status = "exempted" => 0, _ => income * 0.2
}
--> (case is multi-line! @append_spaced_softline is a line break)
income_tax := case {
status = "exempted" => 0,
_ => income * 0.2
}
We need to amend the rule for braces as well:
; Lay out the case skeleton
(case
"{" @prepend_space @append_spaced_softline
"}" @prepend_spaced_sofline
)
Our original example is now left untouched, as desired. Note that softline
annotations are expanded depending on the multi-lineness of the direct parent of
the node they attach to (and neither the subtree matched by the whole query
nor the node itself). Topiary applies this logic because this is most often what
you want. The parse tree of the multi-line version of income_tax
:
income_tax := case {
status = "exempted" => 0,
_ => income * 0.2
}
is as follows (hiding irrelevant parts in [...]
):
0:0 - 4:0 tax_rule
0:0 - 3:1 statement
0:0 - 3:1 definition_statement
0:0 - 0:10 identifier `income_tax`
0:11 - 0:13 ":="
0:14 - 3:1 expression
0:14 - 3:1 case
0:14 - 0:18 "case"
0:19 - 0:20 "{"
1:2 - 1:26 case_branch
[...]
1:26 - 1:27 ","
2:2 - 2:19 case_branch
[...]
3:0 - 3:1 "}"
The left part is the span of the node, in the format start_line:start_column - end_line:end_column
. A node is multiline simply if end_line > start_line
. You
can see that since "{"
is not multiline (it can’t be, as it’s only one
character!), if Topiary considered the multi-lineness of the node itself, our
previous "{" @append_spaced_softline
would always act as a space.
What happens is that Topiary considers the direct parent instead, which is 0:14 - 3:1 case
here, and is indeed multi-line.
Both single-line and multi-line case
are now formatted as expected.
More line breaks, please
Let’s consider the dual issue, where line breaks are unduly removed. We’d like to allow inputs and outputs to span multiple lines, but the following snippet:
input
income,
status,
tax_coefficient
output income_tax
is formatted as:
input income, status, tax_coefficient
output income_tax
The rule for spacing around input
and
output
and the rule for spacing around
,
and identifiers both use @append_space
. We
can simply replace this with a spaced softline. Recall that a spaced softline
turns into a space and thus behaves like @append_space
in a single-line
context, making it a proper substitution.
; Add spaced softline after `input` and `output` decl
[
"input"
"output"
] @append_spaced_softline
; Add a spaced softline after and remove space before the comma in an identifier
; list
(
(identifier)
.
"," @prepend_antispace @append_spaced_softline
.
(identifier)
)
We also need to add new rules to indent multi-line lists of inputs or outputs.
; Indent multi-line lists of inputs.
(input_statement
"input" @append_indent_start
) @append_indent_end
; Indent multi-line lists of outputs.
(output_statement
"output" @append_indent_start
) @append_indent_end
A matching pair of indentation captures *_indent_start
and *_indent_end
will
amount to a no-op if they are on the same line, so those rules don’t disturb the
single-line layout.
Recall that as long as you don’t use anchors (.
), additional nodes can be
omitted from a Tree-sitter query: here, the first query will match an input
statement with an "input"
child somewhere, and any children before or after
that (although in our case, there won’t be any children before).
Scopes
More (scoped) line breaks, please
Let us now consider a similar example, at least on the surface. We want to allow long arithmetic expressions to be laid out on multiple lines as well, as in:
input
some_long_name,
other_long_name,
and_another_one
output result
result :=
some_long_name
+ other_long_name
+ and_another_one
As before, result
is currently smashed back into one line by our current
formatter. Unsurprisingly, since our keywords rule uses
@prepend_space
and @append_space
. At this point, you start to get the trick:
let’s use softlines! I’ll only handle +
for simplicity. We remove "+"
from
the original keywords rule and add the following rule:
; (Multi-line) spacing around +
("+" @prepend_spaced_softline @append_space)
Ignoring indentation for now, the line wrapping seems to work. For the following example at least:
result :=
some_long_name
+ other_long_name + and_another_one
which is reformatted as:
result := some_long_name
+ other_long_name
+ and_another_one
However, perhaps surprisingly, the following example:
result :=
some_long_name + other_long_name
+ and_another_one
is reformatted as:
result := some_long_name + other_long_name
+ and_another_one
The first addition hasn’t been split! To understand why, we have to look at how our grammar parses arithmetic expressions:
expression: $ => choice(
$.identifier,
$.number,
$.string,
$.arithmetic_expr,
$.case,
),
arithmetic_expr: $ => choice(
prec.left(1, seq(
$.expression,
choice('+', '-'),
$.expression,
)),
prec.left(2, seq(
$.expression,
choice('*', '/'),
$.expression,
)),
prec(3, seq(
'(',
$.expression,
')',
)),
),
Even if you don’t understand everything, there are two important points:
- Arithmetic expressions are recursively nested. Indeed, we can compose
arbitrarily complex expressions, as in
(foo*2 + 1) + (bar / 4 * 6)
. - They are parsed in a left-associative way.
This means that our big addition is parsed as: ((some_long_name "+" other_long_name) "+" and_another_one)
. In the first example, since the line
break happens just after some_long_name
in the original source, both the inner
node and the outer one are multi-line. However, in the second example, the line
break happens after other_long_name
, meaning that the innermost arithmetic
expression is contained in a single line, and the corresponding +
isn’t
considered multi-line. Indeed, you can see here that the parent of the first +
is 7:0 - 7:32 arithmetic_expr
, which fits entirely on line 7
.
7:0 - 8:17 arithmetic_expr
7:0 - 7:32 expression
7:0 - 7:32 arithmetic_expr
7:0 - 7:14 expression
7:0 - 7:14 identifier `some_long_name`
7:15 - 7:16 "+"
7:17 - 7:32 expression
7:17 - 7:32 identifier `other_long_name`
8:0 - 8:1 "+"
8:2 - 8:17 expression
8:2 - 8:17 identifier `and_another_one`
The solution here is to use scopes. A scope is a user-defined group of nodes
associated with an identifier. Crucially, when using scoped softline captures
such as @append_scoped_space_softline
within a scope, Topiary will consider
the multi-lineness of the whole scope instead of the multi-lineness of the
(parent) node.
Let’s create a scope for all the nested sub-expressions of an arithmetic
expression. Scopes work the same as other node groups in Topiary: we create them
by using a matching pair of begin
and end
captures. We need to find a parent
node that can’t occur recursively in an arithmetic expression. A good candidate
would be definition_statement
, which
encompasses the whole right-hand side of the definition of an output:
; Creates a scope for the whole right-hand side of a definition statement
(definition_statement
(#scope_id! "definition_rhs")
":="
(expression) @prepend_begin_scope @append_end_scope
)
We must specify an identifier for the scope using the
predicate scope_id
. Identifiers are useful when
several scopes might be nested or even overlap, and help readability in general.
We then amend our initial attempt at formatting multi-line arithmetic expressions:
; (Multi-line) spacing around +
(
(#scope_id! "definition_rhs")
"+" @prepend_scoped_spaced_softline @append_space
)
We use a scoped
version of softlines, in which case we need to specify the
identifier of the corresponding scope. The captured node must also be part of
said scope. You can check that both examples (and multiple variations of them)
are finally formatted as expected.
Conclusion
This second part of the Topiary tutorial has taught how to finely specify an alternative formatting layout depending on whether an expression spans multiple lines or not. The main concepts at play here are multi-line versus single-line nodes, and scopes. There is an extension to this concept not covered here, measuring scopes, but standard scopes already go a long way for formatting a real life language. If you’re looking for a comprehensive resource to help you write your formatter, the official Topiary book is for you. You can however find the complete code for this post in the companion repository. Happy hacking!
Behind the scenes
Yann is the head of the Programming Languages & Compiler group at Tweag. He's also leading the development of the Nickel programming language, a next-generation typed configuration language designed to manage the growing complexity of Infrastructure-as-Code and a candidate successor for the Nix language. You might also find him doing Nix or any other trickery to fight against non-reproducible and slow builds or CI.
If you enjoyed this article, you might be interested in joining the Tweag team.