The shell is at the heart of Unix. It's the glue that makes all the little Unix tools work together so well. Understanding it sheds light on many of Unix's important ideas, and writing our own is the best path to that understanding.
Earlier this year, at a place I worked, I decided to run a series of workshops on writing a Unix shell. A lot of questions had come up that I think writing a shell leads you through, as well as issues that suggested tenuous mental models of the shell and its scripting language.
A small sampling of those kinds of questions:
- why does this Python replacement for this shell script deadlock?
- why are there so many Unicorn processes on this server?
- when are variables quoted in a shell script? why aren't they always visible to other programs I run?
- how does
- how does control-C (
^C) work? why do I need to use
^\sometimes instead? why doesn't
^Cwork the way I'd expect on this bash
- what's special about init?
- why do we do these steps in daemonization?
At this company, we had a regular Friday afternoon workshop/lecture series. I had previously tried to do an overview of Unix processes relationships, but it felt too abstract. So, I tried to make it more concrete by getting everyone to actually implement a shell.
Initially, this was just a rough layout of what I thought I could cover in each session, and pointers to manpages. I never turned this into the full, DIY, self-paced tutorial I had hoped, but (in the spirit of release early, release often) I am opening up my work in progress at https://github.com/tokenrove/build-your-own-shell.
This isn't "finished", but if you're ambitious, you should be able to make something that passes all the tests. I decided it would be better to put it out there, even in rough form, than keep it sealed up. After all, a number of people enjoyed the workshop out of which this came.
(Caveat for macOS and *BSD users: there's still something wonky about the timing in the section that tests signals and job control; hopefully by the time you're reading this, I'll have it worked out, but if not, I apologize.)
In this post I'll reflect on some choices I made, and follow a few tangents that come up in the text but would be disruptive there.
I decided that, for this to be useful for self-study, it should contain an automated test suite. I love Tcl and expect, and had figured it would be a natural tool for testing the interactive components of shells. I took a quick look at how other shells were testing themselves. Most were strictly non-interactive tests, using shell scripts and comparing with expected output. A nice exception here is fish, which indeed uses expect for its interactive tests.
This makes sense, but I wanted to focus on interactive shells: in part because so many tutorials ignored the considerations of interactive shells, but also because I felt people would enjoy themselves more if they could use the shell they were writing directly.
I started with some tests edited from the output of
this turned out to be too fragile. Something I noticed in the first
workshop was that people really enjoyed customizing their prompt; this
should be no surprise (prompt customization is a perennial
time-wasting activity in any shell), but it meant I'd have to be
careful about how I matched outputs in tests. In particular, I
couldn't really depend on detecting and matching the prompt.
The other tricky thing is that I couldn't use any feature in the tests
that hadn't been developed yet, so using conditionals or echoing
wasn't possible in the early tests.
I considered writing a wrapper using ptrace(2) that would watch for
wait syscalls from the shell and its children,
and print those in a form easily consumed by a test harness (this
seemed easier to do than cleaning up the output of
things like prompts that exec
git every time, as well as
noted stubbiness on macOS, prevented me from going further with this.
So that's where the workshop sat for a long time, until I finally decided to use a little test description language in place of expect scripts directly. So now a typical test might look like:
→ true || false || echo-rot13 foo⏎ ≠ sbb → false || true && echo-rot13 foo⏎ ← sbb → exit 42 ☠ 42
For whatever reason, expressing things this way allowed me to finally write out all the tests I had intended to have, without focusing too much on the implementation of the test harness. Then I wrote some Tcl to interpret these files.
I decided to go with string matching in the output, which is not
particularly robust, but is simple. Because of discrepancies between
how different shells and TTY drivers draw things, it can be prone to
matching the echoed input as the output if one isn't careful. There
are also some timing issues; the script written by
suggests inserting a 100ms delay between each keystroke sent, but this
makes the tests cripplingly slow; I'm still trying to find a tuning
that is reliable across systems but speedy enough to be usable.
I decided that
mksh should pass all the tests, and
should fail every test. There's nothing worse than a test that fails
to actually test something. This reminds me of the admonishment
"don't try to do what a corpse can do better": goals phrased in the
negative (like "stop reading Hacker News") are hard to achieve — the
cat) will always do them better than you. Positive goals
(and tests) are more actionable.
There are still some timing issues on different platforms, but I don't regret making the simple choice for now.
Minimal shell builtins
Doing the workshop lead me to think about minimizing shell builtins;
one of the questions that comes up a lot is why
cd needed to be a
builtin, but what doesn't come up until one is much deeper into
pipelines and job control is what a pain builtins are, in how they
interact with the rest of the shells features. It would be nice to
get rid of them.
There are some commands which are builtins only to make them fast,
false. These usually have equivalents in
Some builtins are required because they modify the shell's own
ulimit. (This is excluding really tricky, impractical things, like
using shared memory,
ptrace to modify the
shell from an outside process.)1
To prove a point, you could take functional programming to an extreme
and have an immutable shell where
cd executes a new shell in the
chosen directory, but some of the others are probably not possible in
the presence of typical job control.2
If we take this line of thought further, we can try externalizing some
of the shell's operators. Conditional execution is interesting. How
||? Syntactically, we probably can't pull these off
as external commands, but we could provide commands
which take commands to execute.
if is an obvious next step from
or. Now we
while, although we'd have to be careful about how we
handle the environment if we wanted to handle many typical uses of
for loop almost already exists in this form, as
would probably want to provide both a sequential version, where the
environment for each iteration depends on the previous, and a parallel
version where everything can run at the same time.
Note that most of these approaches require that you have mechanisms for escaping that aren't too cumbersome, for them to be practical. There seems to be a close parallel with macro facilities in languages like Lisp.
At the extreme side of cumbersome quoting would be
case, which you'd
probably want to take its input from a heredoc.
I was originally going to write a proof of concept of this (called "builtouts"), but researching this lead me to the intriguing execline "shell", which has already done this, and explored this space rather nicely.
One thing that
execline doesn't seem to do is implement something
resembling real job control. If
bg executes a command without
waiting and then re-executes the shell with a suitable variable set
(to the PGID of this job), the shell on each execution can check this
variable to see what jobs are still alive; the
jobs command can
print the contents of this variable; the
fg command just becomes
wait with the PGID of the current job. For an
interactive shell, the tricky thing is probably making sure that
bg's children don't end up in an orphaned process group.
A lot of these programs end up having to deal with quoting. Is there
a way to take this further and handle quoting in its own program? For
fixed-arity programs (like
if), we can imagine an
that calls a subsidiary program with, first, the fixed remaining
arguments, and then all of the original quoted argument, expanded, as
the remaining arguments.
Long ago, in UNIX V6, there was a program /etc/glob that would expand wildcard patterns. Soon afterward this became a shell built-in.
Luckily, the source is available in Diomidis Spinellis's unix-history-repo, and we can see that it does this same kind of chain loading, executing its first argument with the rest of its arguments expanded according to the globbing rules.
Objet trouvé engineering
(Found object engineering, often called cargo cult programming.)
Now we get to the inflamatory bits, for those who kept reading.
Stackoverflow modernized, but did not create, the practice of assembling Frankenstein programs from poorly understood and imitated examples, but I think no language has been more greatly affected by this than shell, as evidenced by the bizarre ready-made shell scripts one can encounter almost everywhere. Sometimes, the evolution of these patterns reminds me of semantic drift in languages.
A lot of constructs are poorly understood and misused. I'm not blaming people, though; part of the problem is that I can't easily point to a single, modern reference work that someone should read before writing shell scripts. And, since shell scripts often feel like "configuration" rather than "programming", I imagine people don't even think about learning shell as a programming language.
Writing a shell helps disabuse people of some common confusions, for example that:
- bash is shell scripting, definitively, and if you write a shell script, it is a "bash script" ("all the world's a VAX" and all that);
- quotes make something into a string;
- double and single quotes are interchangeable;
- the argument to
while, et cetera is something magical;
export FOO=xrepeatedly does something;
- variable case has some magic properties;
- et cetera.
Why write a shell
In the workshop, I cite the following motivations for writing a shell:
- to give you a better understanding of how Unix processes work;
- this will make you better at designing and understanding software that runs on Unix;
- to clarify some common misunderstandings of POSIX shells;
- this will make you more effective at using and scripting ubiquitous shells like bash;
- to help you build a working implementation of a shell you can be
excited about working on.
- there are endless personal customizations you can make to your own shell, and can help you think about how you interact with your computer and how it might be different.
I've already touched on the first two, but the third is maybe less obvious. The shell remains a ubiquitous interface, decades after we imagined other modes of interaction would replace it. The field is ripe with opportunities for improvements.
There are a lot of people exploring this space in interesting ways, but I think there's room for so much more.
A lot of existing tutorials focus on the non-interactive case, and I think people will have more fun if they build a shell they can use interactively.
Aside from the interactive case, a lot of infrastructure is held together with shell scripts.
There's a commonly held belief that scripting languages like Perl, Ruby, and Python are complete replacements for shell scripting. My own experience is that these languages lack the expressive tools of the shell for working with pipelines, exit statuses, redirections, and so on, and the replacement code is often:
- sequential rather than parallel, and often much slower for this reason;
- full of deadlocks, race conditions, and signal handling issues;
- much more verbose and less clear than an equivalently carefully-written shell script.
So, I feel there's still room for new tools in this space, too.
If you've been thinking about writing a shell for a while and haven't gotten around to it, why not try my workshop or any of the tutorials it links to?
POSIX avoids dictating exactly what must be a builtin, but does specify that the following commands must be executed no matter if they are in the path:
alias bg cd command false fc fg getopts jobs kill newgrp pwd read true umask unalias wait
Most of these have something to do with the shell's internal state, but not all.