Bashfulness

13 February 2025 — by Christopher Harrison

When I first joined the Topiary Team, I floated the idea of trying to format Bash with Topiary. While this did nothing to appease my unenviable epithet of “the Bash guy,” it was our first foray into expanding Topiary’s support beyond OCaml and simple syntaxes like JSON.

Alas, at the time, the Tree-sitter Bash grammar was not without its problems. I got quite a long way, despite this, but there were too many things that didn’t work properly for us to graduate Bash to a supported language.

Fast-forward two years and both Topiary and the Tree-sitter Bash grammar have moved on. As the incumbent Bash grammar was beginning to cause downstream problems from bit rot — frustratingly breaking the builds of both Topiary and Nickel — my fellow Topiarist, Nicolas Bacquey, migrated Topiary to the latest version of the Bash grammar and updated our Bash formatting queries to match.

With surprisingly little effort, Nicolas was able to resolve all those outstanding problems. So with that, Bash was elevated to the lofty heights of “supported language” and — with the changes I’ve made from researching this blog post — Bash formatting is now in pretty good shape in Topiary v0.6.

So much so, in fact, let me put my money where my mouth is! Let’s see how Topiary fares against a rival formatter. I’ll do this, first, by taking you down some of the darker alleys of Bash parsing, just to show you what we’re up against.

Hello darkness, my old friend

There is a fifth dimension beyond that which is known to man. It is a dimension as vast as space and as timeless as infinity. It is the middle ground between light and shadow, between science and superstition; it lies between the pit of man’s fears and the summit of his knowledge. This is the dimension of imagination. It is an area we call: the Bash grammar.

In our relentless hubris, man has built a rocket that — rather than exploding on contact with reality — dynamically twists and turns to meet reality’s expectations. Is that a binary? Execute it! Is that a built-in? Execute it! Is that three raccoons in a trench coat, masquerading as a function? Execute it! And so, with each token parsed, we are Bourne Again and stray ever further from god.

Bear witness to but a few eldritch horrors:¹

Trailing comments must be preceded by whitespace or a semicolon. However, if either of those are escaped, they are interpreted as literals and this changes the tokenisation semantics:
```
echo \ # Ceci n'est pas
 | une pipe'
```
Here, perhaps the writer intended to add a comment against the first line. But, what looks like a comment isn’t a comment at all; it becomes an argument to echo, along with everything that follows. That includes the apostrophe in “n’est”, which is interpreted as an opening quote — a raw string — which is closed at the end of the next line.
Case statements idiomatically delimit each branch condition with a closing parenthesis. In a subshell, for example, this leads to unbalanced brackets:
```
( case $x in foo )   # Wat?...
echo bar;; esac )    # 🤯
```
This subshell outputs bar when the variable $x is equal to foo. Whereas, on a more casual reading, this formulation might just look like a confusing syntax error.

Speaking of case statements, did you know that ;& and ;;& are also valid branch terminators? Without checking the manual — if you can find the single paragraph where it’s mentioned — can you tell me how they differ?
Bash will try to compute an array index if it looks like an arithmetic expression:
```
# Output the (foo - bar)th element of array
echo "${array[foo-bar]}"
```
However, if array in this example is an associative array (i.e., a hash map/dictionary), then foo-bar could be a valid key. In which case, it’s not evaluated and used verbatim.
Without backtracking, it’s not possible to distinguish between an arithmetic expansion and a command substitution containing a subshell at its beginning or end:
```
echo $((foo + bar))
echo $((foo); (bar))
```
Here, the first statement will output the value of the addition of those two variables; the second will execute foo then bar, each in a subshell, echoing their output. In the subshell case, the POSIX standards even recommend that you add spaces — e.g., $( (foo) ) — to remove this ambiguity.
Heredocs effectively switch the parser into a different state, where everything is interpreted literally except when it isn’t. This alone is tricky, but Bash introduces some variant forms that allow additional indentation (with hard tabs), switching off all string interpolation, or both.
```
# Indented, with interpolation
cat <<-HEREDOC
	I am a heredoc. Hear me roar.
	HEREDOC
```

Suffice to say, any formatter has their work cut out.

Battle of the Bash formatters

The de facto formatter for Bash is shfmt. It’s written in Go, by Daniel Martí, actively maintained and has been around for the best part of a decade.

Let’s compare Topiary’s Bash formatting with shfmt in a contest worthy of a Netflix special. I’ll look specifically at each tool’s parsing and formatting capabilities as well as their performance characteristics. I won’t, however, compare their subjective formatting styles, as this is largely a matter of taste.

What Topiary can’t do that `shfmt` can²

When it comes to formatting Bash in a way that is commonly attested in the wild, there are three things that Topiary cannot currently do. Unfortunately, these are either from the absence of a feature in Topiary, or a lack of fidelity in the Tree-sitter grammar; no amount of hacking on queries will fix them.

The worst offender is probably the inability to distinguish line continuations from other token boundaries. These are used in Bash scripts all the time to break up long commands into more digestible code. In the following example, the call to topiary was spread over multiple lines, with line continuations. Topiary slurps everything onto a single line, whereas shfmt preserves the original line continuations in the input:

# Topiary
topiary format --language bash --query bash.scm <"${script}"

# shfmt
topiary format \
    --language bash \
    --query bash.scm \
    <"${script}"

One saving grace is that Topiary’s Bash parser understands a trailing |, in a pipeline, to accept a line break. As such — while it isn’t my personal favourite style³ — Topiary does support multi-line pipelines. Arguably, they even look a little nicer in Topiary than in shfmt, which only preserves where the line breaks occurred in the input:

# Topiary
foo |
  bar |
  baz |
  quux

# shfmt
foo | bar |
    baz | quux

Otherwise, in Topiary, every command is a one-liner…whether you like it or not!

Next on the “nice to have” list is the long-standing (and controversial) feature request of “alignment blocks”; specifically for comments. That is, presumably related comments appearing on a series of lines should be aligned to the same column:

# Topiary
here # comment
is # comment
a # comment
sequence # comment
of # comment
commands # comment

# shfmt
here     # comment
is       # comment
a        # comment
sequence # comment
of       # comment
commands # comment

The tl;dr of the controversy is that, despite being a popular request — and we all know where popularity gets us, these days — it’s a slap in the face to one of Topiary’s core design principles: minimising diffs. Because we live in a universe where elastic tabstops never really took off, a small change to the above example — say, adding an option to one of the commands — would produce the following noisy diff:

-here     # comment
-is       # comment
-a        # comment
-sequence # comment
-of       # comment
-commands # comment
+here                      # comment
+is                        # comment
+a                         # comment
+sequence                  # comment
+of                        # comment
+commands --with-an-option # comment

For the time being, Topiary won’t be making alignment great again.

Finally, string interpolations — with command substitution and arithmetic expansions — cannot be formatted without potentially breaking the string itself. This is particularly true of heredocs; the full subtleties of which escape the Tree-sitter Bash grammar and so are easily corruptible with naive formatting changes. As such, Topiary has to treat these as immutable leaves and leave them untouched:

# Topiary
echo "2 + 2 = $((  2+  2 ))"

cat <<EOF
Today is $(   date )
EOF

# shfmt
echo "2 + 2 = $((2 + 2))"

cat <<EOF
Today is $(date)
EOF

So far, I have only found three constructions that are syntactically correct, but the Tree-sitter Bash grammar cannot parse (whereas, shfmt can):

A herestring that follows a file redirection (issue #282):
```
rev > output <<< hello
```
A workaround, for now, is to switch the order; so the herestring comes first.

A heredoc that uses an empty marker (issue #283):

cat <<''
Only a monster would do this, anyway!

Similar to line continuations, the Tree-sitter Bash grammar seems to swallow escaped spaces at the beginning of tokens, interpreting them as tokenisation whitespace rather than literals (issue #284):
```
# This should output:
# <a>
# <b>
# < >
# <c>
printf "<%s>\n" a b \  c
```

For what it’s worth, shfmt also supports POSIX shell and mksh (a KornShell implementation). As of writing, there are no Tree-sitter grammars for these shells. However, their syntax doesn’t diverge too far from Bash, so it’s likely that Topiary’s Bash support will be sufficient for large swathes of such scripts. Moreover, the halcyon years of the 1990s are a long way behind us, so maybe this doesn’t matter.

What `shfmt` can’t do that Topiary can²

shfmt is part of a wider project that includes a Bash parser for the Go ecosystem. A purpose-built parser, particularly for Bash, should perform better than the generalised promise of Tree-sitter and, indeed, that’s what we see. However, there are a few minor constructions that shfmt doesn’t like, but the Tree-sitter Bash grammar accepts:

An array index assignment which uses the addition augmented assignment operator:
```
my_array=(
  foo
  [0]+=bar
)
```
To be fair to shfmt, while this is valid Bash, not even the venerable ShellCheck can parse this!
Topiary leaves array indices unformatted, despite them allowing arithmetic expressions. shfmt, however, will add whitespace to any index that looks like an arithmetic expression (e.g., [foo-bar] will become [ foo - bar ]); even if the original, unspaced version could be a valid associative array key.

(Neither Topiary nor shfmt can handle indices containing spaces. However, the standard Bash workaround™ is to quote these: ${array["foo bar"]}.)
Brace expansions can appear — perhaps surprisingly — almost anywhere. Particularly surprising to shfmt is when they appear in variable declarations, which it cannot parse:
```
declare {a,b,c}=123      # a=123 b=123 c=123
declare foo{1..10}=bar   # foo1=bar foo2=bar ... foo10=bar
```

While it’s a bit of a hack,⁴ we also implement something akin to “rewrite rules” in our Topiary Bash formatting queries, which shfmt (mostly) doesn’t do. This is to enforce a canonical style over certain constructions. Namely:

All $... variables are rewritten in their unambiguous form of ${...}, excluding special variables such as $1 and $@. (Note that this doesn’t affect $'...' ANSI C strings, despite their superficial similarity.)
All function signatures are rewritten to the name() { ... } form, rather than function name { ... } or function name() { ... }.
~~All POSIX-style [ ... ] test clauses are rewritten to the Bash [[ ... ]] form.~~

(June 2025: This rewrite was reverted as it can change the semantics of your code.)
All legacy $[ ... ] arithmetic expansions are rewritten to their $(( ... )) form.
All `...` command substitutions are rewritten to their $( ... ) form.

(This is one that shfmt does do.)

Technically, it is also possible to write rules that put quotes around unquoted command arguments, ignoring things like -o/--options. While this is good practice, we do not enforce this style as it changes the code’s semantics and there may be legitimate reasons to leave arguments unquoted.

Throughput

Let’s be honest: If you have so much Bash to format that throughput becomes meaningful, then formatting is probably the least of your worries. That being said, it is the one metric that we can actually quantify.

Our first problem is that we need a large corpus of normal scripts. By “normal,” I mean things that you’d see in the wild and could conceivably understand if you squint hard enough. This rules out the Bash test suite, for example, which — while quite large — is a grimoire of weird edge cases that neither Topiary nor shfmt handle well. Quite frankly, if you’re writing Bash that looks like this, then you don’t deserve formatting:

: $(case a in a) : ;#esac ;;
esac)

Digging around on r/bash, I came across this repository of scripts. They’re all fairly short, but they’re quite sane. This will do.

We need to slam large amounts of Bash into the immovable objects that are our formatters; a “Bash test dummy,”⁵ if you will. It would be ideal if we could stream Bash into our formatters — so we could orchestrate sampling at regular time intervals — however, neither Topiary nor shfmt support streaming formatting. This stands to reason as there are cases where formatting will depend on some future context, so the whole input will need to be read upfront. As such, we need to invert our approach to collecting metrics and sample over input size instead.

The general method is:

Locate the scripts in the repository that are Bash, by looking at their shebang.
Filter this list to those which Topiary can handle without tripping over itself because of some obscure parsing failure. (We assume shfmt doesn’t require such a concession.)
Perform $N$ trials, in which:
- The whitelist of scripts is randomised, to remove any potential confounding from caching.
- The top $M$ scripts are concatenated to obtain a single trial input.⁶ This is to increase the input size to the formatters in each trial, which is presumed to be the dependent variable, but may be subject to confounding effects when the input is small.
- The trial input is read to /dev/null a handful of times to warm up the filesystem cache.
- The trial input is fed into the following, with benchmarks — trial input size (bytes) and runtime (nanoseconds) — recorded for each:
  - cat, which acts as a control;
  - Topiary (v0.5.1; release build, with the query changes described in this blog post);
  - Topiary, with its idempotence checking disabled;
  - shfmt (v3.10.0).

This identified 156 Bash scripts within the test repository; of which, 154 of them could be handled by Topiary.⁷ On an 11th generation Intel Core i7, at normal stepping, with $N=50$ and $M=25$ , on a Tuesday afternoon, I obtained the following results:

cat, which does nothing, is unsurprisingly way out in front; by two orders of magnitude. This is not interesting, but establishes that input can be read faster than it can be formatted. That is, our little experiment is not accidentally I/O bound.

What is interesting is that Topiary is about 3× faster than shfmt. We also see that the penalty imposed by idempotency checking — which formats twice, to check the output reaches a fixed point — is quite negligible. This indicates that most of the work Topiary is doing is in its startup overhead, which involves loading the grammar and parsing the formatting query file.

Since Topiary only has to do this once per trial, it’s a little unfair to set $M=25$ ; that is, an artificially enlarged input that is syntactically valid but semantically meaningless. However, if we set $M=1$ (i.e., individual scripts), then we see a similar comparison:

For small inputs, the idempotency check penalty is barely perceptible. Otherwise, the startup overhead dominates for both formatters — hence the much lower throughput values — but, still, Topiary comfortably outperforms shfmt by a similar factor.

And the winner is…

In an attempt to regain some professional integrity, I’ll fess up to the fact that Topiary has a bit of a home advantage and maybe — just maybe — I’m ever so slightly biased. That is, as we are in the (dubious) position of building a plane while attempting to fly it, I was able to tweak and fix a few of our formatting rules to improve Topiary’s Bash support during the writing of this blog post:

I added formatting rules for arrays (and associative arrays) and their elements.
I corrected the formatting of trailing comments that appear at the end of a script.
I corrected the function signature rewriting rule.
I corrected the formatting of a string of commands that are interposed by Bash’s & asynchronous operator.
I fixed the formatting of test commands.
I implemented multi-line support for pipelines.⁸
I updated the $... variable rewrite rule to avoid targeting special forms like $0, $? and $@, etc.
I implemented a rewrite rule that converts legacy $[ ... ] arithmetic expansions into their $(( ... )) form.
I implemented a rewrite rule that converts `...` command substitutions into their $(...) form.
I fixed the spacing within variable declarations, to accommodate arguments and expansions.
I forced additional spacing in command substitutions containing subshells, to remove any ambiguity with arithmetic expansions.

The point I’m making here is that these adjustments were very easy to conjure up; just a few minutes of thought for each, across our Tree-sitter queries, was required.

So who’s the winner?

Well, would it be terribly anticlimactic of me, after all that, not to call it? shfmt is certainly more resilient to Bash-weirdness and, of the “big three” I discussed, its line continuation handling is a must have. However, Topiary does pretty well, regardless: It’s much faster, for what that’s worth, and — more to the point — far easier to tweak and hack on.

Indeed, when the Topiary team first embarked upon this path, we weren’t even sure whether it would be possible to format Bash. Now that the Tree-sitter Bash grammar has matured, Topiary — perhaps with future fixes to address some of its shortcomings, uncovered by this blog post — is a contender in the Bash ecosystem.

Thanks to Nicolas Bacquey, Yann Hamdaoui, Tor Hovland, Torsten Schmits and Arnaud Spiwack for their reviews and input on this post, and to Florent Chevrou for his assistance with the side-by-side code styling.

It’s very likely that the syntax highlighting for the more exotic Bash snippets in this blog post will be completely broken.↩
…Yet.↩
My preferred multi-line pipeline style is to have a line continuation and then the | character on the next line, indented:
```
foo \
  | bar \
  | baz \
  | quux
```
I personally find this much clearer, but Topiary cannot currently handle those pesky line continuations. For shame!↩
Topiary’s formatting rules include node deletion and delimiter insertion. However, delimiters can be any string, so we can coopt this functionality to create basic rewrite rules.↩
I’m also the “terrible pun guy.”↩
This exposed an unexpected bug, whereby Topiary’s formatting model breaks down when some complexity (or, by proxy, size) limit is reached. This behaviour had not been previously observed and further investigation is required.↩
The two failures were due to the aforementioned herestring and complexity⁶ problems.↩
It may also be possible to implement multi-line && and || lists in a similar way. However, the Tree-sitter grammar parses these into a left-associative nested (list) structure, which is tricky to query.↩

Behind the scenes

Christopher Harrison

Chris is a principal software engineer (mainly Python and Rust) and editor-in-chief of the Open Source Programme Office technical blog; he is also the project steward of Topiary, a universal code formatting engine. He has spent much of his career working for academia, from both the business and the research sides of the industry. He particularly enjoys writing well-tested, maintainable code that serves a pragmatic end, with a side-helping of DevOps to keep services ticking with minimal fuss.

If you enjoyed this article, you might be interested in joining the Tweag team.

This article is licensed under a Creative Commons Attribution 4.0 International license.

← The refactoring of a Haskell codebase From minimal skeletons to comprehensive transactions with cooked-validators →