Tweag
Technical groups
Dropdown arrow
Open source
Careers
Research
Blog
Contact
Consulting services
Technical groups
Dropdown arrow
Open source
Careers
Research
Blog
Contact
Consulting services

Writing a formatter has never been so easy: a Topiary tutorial

30 January 2025 — by Yann Hamdaoui

A bit more than one year ago, Tweag announced our open-source, universal formatting engine Topiary, based on the tree-sitter ecosystem. Since then, Topiary has been serving as the official formatter (under the hood) for the Nickel configuration language. Topiary also supports a bunch of other languages (CSS, TOML, OCaml, Bash) and we are seeing people trying it out to support even more languages such as Catala, Nushell, Nix, and more. While I’ve kind of been part of the project from a distance, I’m first and foremost a happy user of Topiary, which I genuinely find really cool both conceptually and practically. While the technical documentation provides an extensive description of Topiary’s capabilities, it doesn’t include (as of now) a complete step-by-step guide on how to write a new formatter for your own language starting from zero. In this post, I’ll show you precisely how to do that.

Why you should use Topiary

Let’s say that you’ve authored a great payroll management application and created a new niche programming language named Yolo to describe tax logic for different countries (tax calculation is all but a trivial subject!). Developers these days aren’t satisfied with an obscure command-line interpreter anymore. They expect beautiful colors, they expect auto-completion, they expect automatic and uniform formatting, they expect package management and a package registry to distribute their code!

While some of those features are just too much work for a niche language, formatting does sound like a basic commodity that you could provide. Alas, this is only true on the surface. At a high-level, a formatter performs the following steps:

  1. Parse the input to a structured representation
  2. Pretty-print the result while respecting parts of the original layout (comments, some line breaks, etc.)

Sometimes you can reuse the parser and the representation of your language implementation, but it’s not guaranteed, as parsing for formatting, interpretation or for compilation have different requirements. If you’ve ever written a serious pretty-printer, with indentation, single-line versus multi-line layout, line-wrapping and all, you’ll know that it’s also not as simple as it looks. For a serious formatter, you’ll need to search for a variety of patterns and treat them in a specific way.

The worst part about all of this is that many of these tasks are generic (not language specific) and laborious, but we still need to reimplement them for every formatter under the sun. It’s frustrating!

This is where Topiary comes in. Topiary is a generic formatter that leverages tree-sitter, an incremental parsing framework. Chances are your language already has a tree-sitter grammar, or it probably should, if you want basic editor support such as syntax highlighting. Given a tree-sitter grammar definition for a language, Topiary will handle parsing and pretty-printing automatically for you. What’s left to do is to use Topiary’s declarative language to write formatting rules. You can focus on the actual logic of the formatter and delegate the boring stuff to Topiary.

As a teaser, beyond the initial setup, you’ll only need to write rules that look like this somewhere in a file:

; Add indentation to the condition of pattern guards in a match branch
(match_branch
  (pattern_guard
    "if" @append_indent_start
    (term) @append_indent_end
  )
)

And you’ll get a formatter! Neat, isn’t it?

There is one caveat: Topiary doesn’t plan to officially support formatting whitespace-sensitive languages, such as Python or Haskell. Depending on the language, it might or might not be doable, but it is likely to be troublesome.

Writing a formatter for Yolo

A Yolo file defines inputs and outputs for a tax calculation using the eponymous keywords:

input income, status
output net_income, income_tax

The rest of the file defines the output as functions of the inputs and other outputs. They can be either simple arithmetic formulas, or they can be defined by case analysis with basic support for boolean conditions:

income_tax := case {
  status = "exempted" | income < 10000 => 0,
  _ => income * 0.2
}

net_income := income - income_tax

Step 1: the tree-sitter grammar

This tutorial isn’t about writing a tree-sitter grammar, but since it’s a requirement for Topiary and I want this post to be exhaustive, I can’t just leave this part out. I’ll quickly cover how to spin up a tree-sitter grammar for a language and how to understand tree-sitter output.

Setup

You’ll need to install the tree-sitter CLI with a recent version (tested with 0.24). I’ll use Nix to install it, but other installation methods are documented in the tree-sitter documentation.

$ nix profile install nixpkgs#tree-sitter
$ mkdir tree-sitter-yolo
$ cd tree-sitter-yolo
$ tree-sitter init
[.. prompts from tree-sitter to init your repo ..]

tree-sitter init generates a bunch of files, but the one we care about is grammar.js. This is a grammar definition of your language in JavaScript. I won’t go into the details of tree-sitter grammar development but instead just provide a simple definition for our toy language Yolo.

Here is a simple tree-sitter grammar for Yolo. Even if you don’t know JavaScript nor tree-sitter very well, it should be reasonably readable.

Then, we need to ask tree-sitter to generate the parser source files for Yolo and build it:

tree-sitter generate
tree-sitter build

If everything went well, you should have a file yolo.so at the root of your grammar directory.

The grammar

The grammar defines the shape of the tree that tree-sitter will produce and that your formatter will manipulate. You might need to refine the grammar later to support finer formatting rules.

What’s important to understand is how a parse tree is represented. Let’s take the original Yolo example in full and put it in a test.yolo file:

input income, status
output net_income, income_tax

income_tax := case {
  status = "exempted" | income < 10000 => 0,
  _ => income * 0.2
}

net_income := income - income_tax

tree-sitter will parse it to a tree that looks like this1 (some subtrees have been collapsed for brevity):

png

Images aren’t really suitable for interaction and automation, though. Fortunately, tree-sitter uses a syntax called S-expressions to represent and manipulate such trees as text. You can ask tree-sitter to print the text representation:

tree-sitter parse test.yolo --no-ranges

The full output is a bit verbose, but very instructive. Let’s take a quick look at it. I’ve added the corresponding source next to each node as a ;-delimited comment for clarity. The nesting structure is given by the parentheses, which introduce a new node starting with a name and followed by the node’s children.

(tax_rule
  (statement
    (input_statement              ; input income, status
      (identifier)                ; income
      (identifier)))              ; status
  (statement
    (output_statement             ; output net_income, income_tax
      (identifier)                ; net_income
      (identifier)))              ; income_tax
  (statement
    (definition_statement         ; income_tax := case { ... }
      (identifier)                ; income_tax
      (expression
        (case                     ; case { ... }
          (case_branch            ; status = "exempted" | income < 10000 => 0
            condition: (condition ; status = "exempted" | income < 10000
              (condition          ; status = "exempted"
                (identifier)      ; status
                (expression
                  (string)))      ; "exempted"
              [..])               ; | income < 10000
            body: (expression
              (number)))          ; 0
          (case_branch            ; _ => income * 0.2
           [..])
  [..]

You can take another look at the image above and try to match each node with a line in the S-expression (beware that I didn’t collapse exactly the same parts in the S-expression and in the image). We can see labels such as condition: and body: which we have introduced in the grammar using the tree-sitter field() helper, to make things easier to read and to use.

Some nodes seem to be missing from the S-expression: where are the operators or keywords such as |, :=, or case? Those are unnamed nodes in the tree-sitter jargon, which are hidden by default in the S-expression representation — but they are there in the tree nonetheless.

Step 2: the Topiary setup

Let’s now install Topiary and extend it with our grammar. Since Topiary 0.5, we don’t need to mess with the source code nor rebuild it anymore to add a custom language. Instead we can configure it.

First, install Topiary version 0.5.1 or higher. I will once again use Nix magic2, but the Topiary repository comes with pre-built binaries and other installation methods.

nix profile install github:tweag/topiary

Then, write the following Nickel configuration file in your grammar repository:

# topiary-yolo.ncl
{
  languages = {
    yolo = {
      extensions = ["yolo"],
      grammar.source.path = "/path/to/tree-sitter-yolo/yolo.so",
    }
  }
}

This defines the file extensions for yolo and the path to the compiled grammar3. If one day the grammar is published to a git repository, you can specify a git repository and a revision instead. See Topiary’s documentation for more information.

The last ingredient is the query file, which contains the formatting rules. We’ll start with an empty one:

mkdir -p ~/.config/topiary/queries
touch ~/.config/topiary/queries/yolo.scm

Using TOPIARY_LANGUAGE_DIR to point Topiary to our extra query directory, we can now try to format our program. Topiary formats in-place by default, but for now we use shell redirections to avoid mutating the original file:

$ export TOPIARY_LANGUAGE_DIR=~/.config/topiary/queries
$ topiary format --configuration topiary-yolo.ncl --skip-idempotence --language yolo < test.yolo
inputincome,statusoutputnet_income,income_taxincome_tax:=case{status="exempted"|income<10000=>0,_=>income*0.2}net_income:=income-income_tax

Well, that’s not exactly what we expected, but something happened! Because our formatter is somehow empty, and Topiary consider that languages are whitespace-insensitive by default, all spaces have just been eaten up (--skip-idempotence disables a sanity check that would have rejected the output).

We can finally start to write the meat of our Yolo formatter to fix this!

Step 3: the queries

Queries are patterns that match subtrees of the input. A query is decorated with captures, which are attributes that are attached to matched nodes (prefixed with the @ sign). When a query matches, the tree is decorated with the corresponding captures. For tree-sitter, captures are generic extra annotations, but Topiary interprets them to format the output as desired.

I encourage you to read the reference documentation on tree-sitter queries at one point. Topiary’s README lists all captures that you can use with Topiary. Comments are introduced with a leading ; in the query file.

In the following, the code snippets are to be appended to the query file ~/.config/topiary/queries/yolo.scm. First, we’ll tell Topiary to ensure some spacing around operators:

; Do not mess with spaces within strings
(string) @leaf

; Do not remove empty lines between statements, for readability and space
(statement) @allow_blank_line_before

; Always surround operators with spaces
[
  "="
  ">"
  "<"
  "&"
  "|"
  "_"
  "=>"
  "+"
  "-"
  "*"
  ":="
] @prepend_space @append_space

Those queries will match the corresponding nodes wherever they appear in the tree. Now, let’s stipulate that each statement must be separated by at least a new line:

; Add a newline between two consecutive statements
(
  (statement) @append_hardline
  .
  (statement)
)

We’ve used a tree-sitter anchor ., which ensures that this pattern matches two consecutive statements with nothing in between (except maybe unnamed nodes), so that we don’t add a new line before the first one or after the last one, but only between each consecutive pair. Topiary won’t add a second new line if the source already has one: existing spacing is mostly forgotten (except when using @allow_blank_line_before or @append/prepend_input_softline) while query-introduced spacing is accumulated and flattened (this includes whitespace and line breaks). For example, if two different queries append a space after a node, the final result will still be that only one space is appended.

The statement nodes have more content than the query makes it look like, if you look back at the output of tree-sitter parse (a single child and many grand-children) in step 1. Indeed, you can omit irrelevant siblings and children by default in tree-sitter queries.

Let’s format the case branches now. We want to put the initial case { on the same line, then each branch indented and on their own line, and finally the closing } alone on its line.

; Lay out the case skeleton
(case
  "{" @append_hardline @append_indent_start
  "}" @prepend_indent_end
)

; Put case branches on their own lines
(case
  (case_branch) @append_hardline
)

Again, because extra children and siblings can appear in the matched subtree by default, the second query will match each branch of each case expression once, and not only a case expression with a single branch.

It looks like we could merge those two queries since they both control how the case is formatted. However, it’s in fact much harder to get the combined query right than just concatenating both, if even possible. In general, it’s both simpler and better to split your queries into small and topically coherent atoms, even if they apply to the same top-level node.

Let’s try to format a mangled version of our original Yolo file:

input income,
status output net_income, income_tax

income_tax := case { status="exempted"  | income<10000 => 0, _ => income*0.2}


net_income := income -    income_tax
$ topiary format --configuration topiary-yolo.ncl --skip-idempotence --language yolo < mangled.yolo
inputincome,status
outputnet_income,income_tax

income_tax := case{
  status = "exempted" | income < 10000 => 0
  , _ => income * 0.2
}

net_income := income - income_tax

Better, but we have some troubleshooting to do.

First, spaces are missing between input or output and the list of identifiers.

Second, we’d like to add a space after the comma and make sure there’s no space before the comma: input income, status. We also want a space between case and the following {.

Finally, the comma following a case branch is wrongly laid out on the next line. We are impacted by the way we wrote our grammar here: the comma is actually grouped with the next branch in the grammar as repeat(seq(",", $.case_branch)). We could either change the grammar or adapt the query. We choose the latter for simplicity.

Here’s the diff of the fix:

--- a/yolo.scm
+++ b/yolo.scm
@@ -19,6 +19,21 @@
   ":="
 ] @prepend_space @append_space

+; Add space after `input` and `output` decl
+[
+  "input"
+  "output"
+] @append_space
+
+; Add a space after and remove space before the comma in an identifier list
+(
+ (identifier)
+ .
+ "," @prepend_antispace @append_space
+ .
+ (identifier)
+)
+
 ; Add a newline between two consecutive statements
 (
   (statement) @append_hardline
@@ -28,11 +43,17 @@

 ; Lay out the case skeleton
 (case
-  "{" @append_hardline @append_indent_start
+  "{" @prepend_space @append_hardline
+  "}" @prepend_hardline
+)
+
+; Indent the content of case
+(case
+  "{" @append_indent_start
   "}" @prepend_indent_end
 )

 ; Put case branches on their own lines
 (case
-  (case_branch) @append_hardline
+  "," @append_hardline
 )

Now, we can try to format the mangled Yolo file again. We finally get rid of --skip-idempotence as we now output valid Yolo, and can format in-place.

$ topiary format --configuration topiary-yolo.ncl mangled.yolo
$ cat mangled.yolo
input income, status
output net_income, income_tax

income_tax := case {
  status = "exempted" | income < 10000 => 0,
  _ => income * 0.2
}

net_income := income - income_tax

And voilà!

Conclusion

In this post, we’ve seen how to set up a formatter for a new language using Topiary from scratch, creating a tree-sitter grammar, configuring Topiary, and writing our formatting rules. I hope that it’s a convincing demonstration that writing a code formatter has never been easier than today thanks to Topiary. Our formatter is simple but honest. In a follow-up post, I’ll cover more advanced features, such as multi-line versus single-line formatting, measuring scopes, comments, and more. Stay tuned!


  1. You can refer to Topiary’s documentation to learn how to generate those graphs.
  2. Although the Nix way is the easiest, the installation can take some time. Don’t panic if Nix doesn’t show any output for a while. Also note that we don’t install from nixpkgs but directly from the GitHub repository: nixpkgs doesn’t have the latest Topiary version yet.
  3. At the time of writing, using grammar.source.path unfortunately doesn’t work on Windows. You can still use the git revision style to point to your local tree-sitter-yolo repo, see Topiary documentation.

About the author

Yann Hamdaoui

Yann is the head of the Programming Languages & Compiler group at Tweag. He's also leading the development of the Nickel programming language, a next-generation typed configuration language designed to manage the growing complexity of Infrastructure-as-Code and a candidate successor for the Nix language. You might also find him doing Nix or any other trickery to fight against non-reproducible and slow builds or CI.

If you enjoyed this article, you might be interested in joining the Tweag team.

This article is licensed under a Creative Commons Attribution 4.0 International license.

Company

AboutOpen SourceCareersContact Us

Connect with us

© 2024 Modus Create, LLC

Privacy PolicySitemap