Julian Squires

Bug story: getaddrinfo(3) and PBR

2024-01-01T03:30:00+0000

A while ago I was working on wireless access points (APs) based on OpenWrt. One day I discovered that remote logging wasn't working, and the debugging that followed had some surprises.

On OpenWrt, there's a process called logread responsible for shipping the logs to another device via the syslog protocol. These APs don't persist their logs between boots, so sending logs to a system that can store them was essential for diagnosing problems. I noticed logread wasn't running, though it starts on boot, so I added something to the init script to restart logread if it crashed, and was going to call it a day. But I went to test it, and the logs weren't showing up; sometimes, the logs would show up right after the AP booted, but then at some point, it would stop working.

I had already spent a lot of time on the other side of this, the syslog that receives the logs, and was pretty sure the setup was correct there. So I ran logread by hand, and it failed with

failed to connect: Permission denied

What? Permission denied? I read the code to find out where this was happening, and it was in usock(), which is some socket code that's used all over OpenWrt, and there were no obvious calls that could fail with EACCES in it.

After checking some ACLs, making sure this couldn't possibly be a permission problem (it's running as root), I decided to strace logread (this required rebuilding the flash image for the AP, which is why I didn't do it earlier), and saw:

socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC, IPPROTO_UDP) = 8
connect(8, {sa_family=AF_INET, sin_port=htons(65535), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
close(8)                                = 0
socket(AF_INET6, SOCK_DGRAM|SOCK_CLOEXEC, IPPROTO_UDP) = 8
connect(8, {sa_family=AF_INET6, sin6_port=htons(65535), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_scope_id=0}, 28) = -1 EACCES (Perm
ission denied)
close(8)                                = 0

What the heck? First off, the connection logread is trying to make in this case is a TCP connection, and we're giving it an IP address; why is it making UDP connections to localhost? And why are those connections failing?

I had a guess on why this started happening – a little while before, IPv6 had been disabled on these devices. Maybe it hadn't been done thoroughly enough? I checked ip addr, and lo definitely did not have ::1 as an address, and IPv6 was disabled through the disable_ipv6 sysctl.

I decided that it was probably a call to getaddrinfo() making UDP connections – maybe it's trying to resolve DNS – but why port 65535? Is that just an ephemeral port it's choosing every single time?

I tested getaddrinfo from Lua (the only interpreted language on the device), but it worked fine, so there had to be something about how usock was calling it; did it want IPv6 addresses specifically or something?

musl is the libc of choice on these devices. Checking its implementation of getaddrinfo, we see this block of code near the top:

if (flags & AI_ADDRCONFIG) {
    /* Define the "an address is configured" condition for address
     * families via ability to create a socket for the family plus
     * routability of the loopback address for the family. */
    // …
    static const struct sockaddr_in6 lo6 = {
        .sin6_family = AF_INET6, .sin6_port = 65535,
        .sin6_addr = IN6ADDR_LOOPBACK_INIT
    };
    const void *ta[2] = { &lo4, &lo6 };
    // …
    for (i=0; i<2; i++) {
      // …
      int s = socket(tf[i], SOCK_CLOEXEC|SOCK_DGRAM,
                     IPPROTO_UDP);
      if (s>=0) {
        int cs;
        pthread_setcancelstate(
                               PTHREAD_CANCEL_DISABLE, &cs);
        int r = connect(s, ta[i], tl[i]);
        pthread_setcancelstate(cs, 0);
        close(s);
        if (!r) continue;
      }
      switch (errno) {
      case EADDRNOTAVAIL:
      case EAFNOSUPPORT:
      case EHOSTUNREACH:
      case ENETDOWN:
      case ENETUNREACH:
        break;
      default:
        return EAI_SYSTEM;
      }
      // …

And sure enough, usock always sets AI_ADDRCONFIG on flags. So this is a kind of probing connect musl is using to check the validity of IPv4 or IPv6 on the system. The connect is returning EACCES, but musl isn't handling it as part of the errors it considers "normal". It bails out early, and leaves errno set to EACCES where logread prints it out to mystify us.

But why would connect fail with EACCES? The man page doesn't list anything that makes sense for this.¹ Weirder still, I decide to check if there are any IPv6 addresses at all – and there is one, but for eth0, not lo. I delete it, and suddenly logread works.

At this point I start looking for information about musl's getaddrinfo and this issue, and find a patch posted to the mailing list with no replies, never applied.

Sweet! I head over to #musl on IRC and ask them if it wasn't applied for a reason, and they say it must have been overlooked. But then someone tries to reproduce with the instructions in the patch, and can't.

I dive into the kernel source trying to figure out what actually returns EACCES here. There is a lot of code under ip6_datagram_connect so I tried to grep and pray, but there were still too many possibilites to know for sure. This is an opportunity to use ftrace! I had to rebuild the kernel, since these are stripped down images for embedded devices, and I was worried trace-cmd might actually crash the device, but I got a capture fine. I could see clearly that the last useful function called under ip6_datagram_connect was fib6_rule_action, which can return EACCES, but why? What even are these rules?

I spent a while even trying to figure out what these rules are and how to manipulate them. It turns out they're for "policy-based routing" (PBR), which I hadn't really explored before. I didn't even realize some of these firewall-like policies could be handled at this level.

I was running the ip rule command and not seeing anything interesting, until I finally read the source for ip and noticed it defaults to IPv4 – I needed to run ip -6 rule, but that flag is in the ip(8) manpage, not the ip-rule(8) manpage for the subcommand. But running it on the AP, I saw:

# ip -6 rule
0:      from all lookup local
32766:  from all lookup main
4200000001:     from all iif lo lookup unspec 12
4200000002:     from all iif eth0 lookup unspec 12
4200000003:     from all iif eth1 lookup unspec 12

I'm not sure I fully understand these rules now, and it took a bit of looking (and strace to confirm the netlink message being sent) to see that this is action "12", which isn't one of the actions in the (mainline) kernel. But it was enough info to demonstrate that the issue could be reproduced on any Linux system with

ip -6 rule add from all iif lo lookup unspec prohibit

Some discussion on the musl mailing list revealed that action 12 is a special rule OpenWrt adds in their kernel. I discovered that netifd, which manages interfaces and rules on OpenWrt, was setting IPv6 policies like this, even when IPv6 was disabled, so I patched that out. And finally remote logging worked again.

This was a surprising set of interactions. Figuring it out was tractable thanks to having all the source for everything, and reasonable tools for introspection. Is there a moral to this story? Perhaps a few tidbits: strace and ftrace are good; getaddrinfo is bad; maybe don't disable IPv6; and blessed are those who update manpages.

Footnotes:

At the time this happened, the man pages on my system didn't list this, but looking now, I see Stefan Puiu added a note about this, debugging much the same situation as mine. What a time saver this would have been.

the perils of pause(2)

2023-11-30T03:30:00+0000

I recently had a bug in a simple program that has a form I've seen a lot in the last few years: loops and signal handling without masking. The worst thing about these kinds of bugs is that they don't rear their heads immediately – they fall into the class of "huh, it's blocked in a syscall and I'm sure it should have woken up" bugs. Let's look at the problem and then how to lint it.

1. A common mistake

I had some tooling for a test suite that would wait for a specified signal, and then print the name of that signal on stdout. I did this by setting up a signal handler, and then calling pause(), which suspends the program until a signal is delivered (i.e., it always returns EINTR). The program indicates to its cooperating programs that it's ready for the signal by printing "ok".

static sig_atomic_t got;
static void h(int n) { got = n; }
//...
    if (sigaction(sig, &(struct sigaction){.sa_handler=h}, NULL))
        abort();
    write(1, "ok\n", 3);
    do { pause(); } while (sig != got);

Every so often, this program will just hang and the test would time out. Worse yet, it's rare enough that I didn't really notice it when I wrote the code.¹ (Why loop when we only expect one signal? There are other signals that will interrupt pause unless you've gone out of your way to ignore them all; for example, SIGTSTP.)

One possible problematic execution is this:

we print ok;
before we get to pause, the other program sends the signal;
now sig == got, but we pause anyway, and wait for another signal that will never come.

Another common execution with this pattern is this:

we pause, and get interrupted by some other signal;
we test got against our desired sig and see it hasn't triggered;
now our desired signal is delivered, and sig == got, but we're already past the test;
we pause again, and have to wait arbitrarily long (till some other signal wakes us up).

This also happens often in loops with poll or select:

for (;;) {
    // A) either there's fd activity, or we get EINTR
    int n_active = poll(fds, n_fds, INFTIM);
    // [...] handle fds
    // B) check variables set by signal handler
    ...
}

We expect poll to get interrupted if there's a signal, however the signal may arrive after the test at B but before we get back to A.

The solution in all these cases are signal masks, and calls that manipulate them atomically. When a signal arrives while masked by the process, it remains pending until the process unmasks it.

2. Masking versus disposition

Something that's always confusing about this is that masking a signal does not affect its disposition. "Signal disposition" is the action associated with a signal, that is, what should happen when the signal is delivered to the process: either a handler is called, the signal is ignored, or a default action takes place.

Knowing that, you might set a mask for some signal whose disposition is SIG_DFL, and see that it works fine, and then be confused when this doesn't work for signals whose disposition is SIG_IGN. POSIX says:

If the action associated with a blocked signal is anything other than to ignore the signal, and if that signal is generated for the thread, the signal shall remain pending until it is unblocked, it is accepted when it is selected and returned by a call to the sigwait() function, or the action associated with it is set to ignore the signal. — POSIX.1-2017 System Interfaces 2.4.1 Signal generation and delivery

I noticed that OpenBSD has the slightly strange behavior of treating signals whose default disposition is to stop or continue the program as if they have the ignored disposition, so these signals need an explicit handler. This is probably a bug? Although it means we could have avoided looping in our initial example, so one could argue it's the better behavior.

Also, signal disposition is per-process, while signal masks are per-thread – but let's not get into that mess here. (An even bigger mess is then what happens on exec() – dispositions are reset, but masks are inherited, as well as pending signals!)

3. Alternatives that atomically unmask

In the case of pause(2), we can use sigsuspend(2), or a few other similar functions. If we need the signal handler, our code might look like this:

if (sigaction(sig, &(struct sigaction){.sa_handler=h}, NULL))
    abort();
sigset_t set, prev;
sigemptyset(&set);
sigaddset(&set, sig);
sigprocmask(SIG_BLOCK, &set, &prev);
write(1, "ok\n", 3);
do { sigsuspend(&prev); } while (sig != got);

but in this case, we are just waiting for this signal, and don't need to take other action when it arrives, so sigwait(2) suffices:

sigset_t set, prev;
sigemptyset(&set);
sigaddset(&set, sig);
sigprocmask(SIG_BLOCK, &set, &prev);
write(1, "ok\n", 3);
int got;
do { sigwait(&set, &got); } while (sig != got);

We might use sigwaitinfo or sigtimedwait if we need more details than just the signal number. If we were certain no other signal could arrive, we could avoid the loop entirely, but it's nice to protect against cases you might not otherwise consider, like testing the program interactively and hitting ^Z (sending SIGTSTP).

(Note that OpenBSD also lacks sigwaitinfo and sigtimedwait presently.)

For poll(2) and select(2), unmasking variants ppoll(2) and pselect(2) exist for this reason. (Linux also has signalfd(2), which more naturally integrates with polling loops, but note it only reads pending signals, so you still need to mask with sigprocmask, and now you have to deal with reading siginfo out of a buffer. Oh, and what you actually get out of the fd depends on which process is reading…) There's also the classic self-pipe trick.

4. Linting

This makes me wonder about some kind of review-level lints that only apply to new code being added. Ideally we'd flag any code which accesses variables assigned from signal handlers and calls one of these functions, in a loop.

Here's my attempt at partially doing this with coccinelle:

@@
sig_atomic_t signal_handler_variable;
@@
*   signal_handler_variable
    ...
*   \(pause\|poll\|select\)(...)

@@
sig_atomic_t signal_handler_variable;
@@
*   \(pause\|poll\|select\)(...)
    ...
*   signal_handler_variable

This will match any function that contains access to a sig_atomic_t and a call to pause, poll, or select. If you save that as lint-pause.cocci, you can check code with

spatch --very-quiet --sp-file lint-pause.cocci path/to/c/files

Note that I am just using * to print out match cases for brevity, but you can add Python scripts to coccinelle rules for much prettier/more elaborate reporting.

It's possible to do much more careful matching, like ensuring the poll calls happen in a loop, or only matching polls with no timeout, but this simple form is sufficient to catch interesting cases to examine later. I also discovered while writing a more elaborate form that do {} while matching was only merged last year and distributions tend to carry older version of spatch.

Note that it only catches use of sig_atomic_t; while testing this, I found some old code that just doesn't even use volatile; at some point I may write a more elaborate script that flags all globals set from any function passed to sigaction, but as a review reminder, this simpler form suffices for my needs.

5. Conclusion

dash had a bug of this flavor, and fixed it with sigsuspend;
busybox ash and hush acknowledge that they have this bug.

I notice that Kerrisk's LPI talks about pause in section 20.14, but doesn't note its perils, except to indicate that other ways will be investigated in section 22.9. There, Kerrisk introduces sigsuspend and talks about exactly our problem with pause.

glibc's info page talks about this extensively, so it's unfortunate that, for example, Linux's man page for pause(2) contains no such details.

Running the aforementioned coccinelle script across an arbitrary corpus of packages I have on hand turns up a number of likely instances of this bug, so this is still an issue worth keeping in mind.

Footnotes:

Note that this program was designed to only handle a single signal. If that handler was registered for multiple signals, you could have races around what value got gets. It used to be you could have got be a mask – e.g. got |= 1<<n; – except these days SIGRTMAX and _SIG_MAXSIG can be way higher than whatever the width of sig_atomic_t is, so I guess people end up doing an array of sig_atomic_t which is maybe 64 times less dense than you'd like. :-/

worried by wordexp(3)

2023-11-21T03:30:00+0000

The function wordexp(3) is a POSIX C standard library function which performs "word expansion like a POSIX shell". wordexp(3) combines the safety of elaborate string parsing in C with the efficiency and robustness of invoking the shell on arbitrary user input. Why does it even exist? And why shouldn't you use it?

1. usage

Probably the most legit use is by init-style programs executing a command line (e.g. finit); though, since many wordexp implementations invoke the shell anyway, these might as well exec sh -c 'exec ...' instead.

Applications typically use wordexp to expand tildes and globs (~/*.txt), and are oblivious to its excessive powers. Mostly, these uses are in places like configuration files the user directly controls, so any disasters as a consequence of wordexp can be considered the user's fault.

More severe are the cases where wordexp's input comes from an untrusted source¹ or the program is in question is setuid². Sometimes people use it when parsing files, even (e.g. this tinygltf issue ended up affecting blender).³

I continue to find code copied from Stack Overflow where it remains recommended, despite it being unsafe, and probably slow, too.

All this would just make wordexp seem like just another call like popen or system where it's obvious that you're opening Pandora's box, except wordexp has a flag, WRDE_NOCMD, intended to prevent the worst abuses of it. The existence of this flag is a mistake, because almost no libc actually tries to make it consistently safe. This flag may imply to people that wordexp is ever safe to use on untrusted input. However WRDE_NOCMD is effectively broken depending on the combination of shell and libc in use.

2. a central problem

Command substitution has two forms in shell: backtick-delimited (`command`) and dollar-parenthesized ($(command)). The former presents more problems for the user, but fewer for the author of the parser: simply scan ahead for a matching backtick, obeying other escaping and quoting.

The latter form is often where first-time shell writers fall into despair. It may be nested, and contain content-dependent unbalanced parentheses ⁴. The lexer must mutually-recursively invoke the parser just to find the end of the token.

The POSIX standard is ambiguous about a concerning detail of shell syntax: the difference between command and arithmetic substitution. Don't expend too much effort trying to understand this:

If the current character is an unquoted '$' or '`', the shell shall identify the start of any candidates for parameter expansion, command substitution, or arithmetic expansion from their introductory unquoted character sequences: '$' or "${", "$(" or '`', and "$((", respectively. The shell shall read sufficient input to determine the end of the unit to be expanded (as explained in the cited sections). While processing the characters, if instances of expansions or quoting are found nested within the substitution, the shell shall recursively process them in the manner specified for the construct that is found. The characters found from the beginning of the substitution to its end, allowing for any recursion necessary to recognize embedded constructs, shall be included unmodified in the result token, including any embedded or enclosing substitution operators or quotes. The token shall not be delimited by the end of the substitution.

— POSIX Shell Command Language, 2.3 Token Recognition

Each shell and wordexp implementation have their own take on how to deal with ambiguous expressions like these:

$((echo a);(echo b))
$((echo "(");(echo ")"))
$((case a in *) echo b;; esac))

An actual shell needs to recursively parse these expressions, whereas most wordexp implementations try to simply match parentheses. The last example in particular offers an opportunity to introduce arbitrary parentheses.

The POSIX rationale does say:

Arithmetic expansions have precedence over command substitutions. That is, if the shell can parse an expansion beginning with "$((" as an arithmetic expansion then it will do so. It will only parse the expansion as a command substitution (that starts with a subshell) if it determines that it cannot parse the expansion as an arithmetic expansion. If the syntax is valid for neither type of expansion, then it is unspecified what kind of syntax error the shell reports.

— POSIX Rationale for Shell and Utilities, 2.6

But who's reading that? Clearly not the authors of most popular shells:

shell	`$((echo a);(echo b))`	`$((echo "(");(echo ")"))`	`$((case a in *) echo b;; esac))`
dash	expects )	parse error	parse error
bash	`a b`	`( )`	syntax error
zsh	`a b`	bad math expression	`b`
mksh	`a b`	`( )`	`b`
hush	expects )	syntax error	syntax error

And while perhaps zsh is unlikely to ever be /bin/sh, all of the others are reasonable candidates for it that I've seen on other systems.

3. implementations

Now that we know shells handle these expressions inconsistently, how do different libcs implement wordexp?

3.1. OpenBSD

OpenBSD has the best possible implementation of wordexp(3): none. The demerits of the function are discussed in this thread from 2010.

3.2. leveraging the shell

There are implementations which try to keep wordexp simple by shelling out, which was probably the intended behavior when the function was first created. Unfortunately, this means WRDE_NOCMD can't be trusted in these libcs, without the direct assistance of the shell.

3.2.1. musl

musl's implementation is nice and simple, because it shells out. Unfortunately, in this simplicity, there's no way to really enforce WRDE_NOCMD. musl tries, by matching parentheses, but $((echo a);(echo b)) or similar will get around it, as long as /bin/sh supports such contortions (this confuses dash, but bash happily runs these commands).

3.2.2. Apple / FreeBSD

FreeBSD added wordexp in 2002, using the problematic shell-invoking approach. To FreeBSD's credit, they fixed the major issues with this approach circa 2015 by specializing the shell, indeed noting:

Shell syntax is too complicated to detect command substitution and unquoted operators reliably without implementing much of sh's parser. Therefore, have sh do this detection.

macOS has inherited versions of this implementation, with some modifications (1044.1.2, 1534.81.1). An important practical difference, though, is that on FreeBSD, /bin/sh is always their ash, which at the least doesn't suffer from the aforementioned parsing problem, while on macOS /bin/sh has been bash.⁵

One interesting twist of Apple's implementation is that in the past, they shelled out to perl, via popen. Last time I checked, they use a helper called /usr/lib/system/wordexp, but this did nothing to prevent command substitution – Apple's libc suffered the same problem as musl. (The shell situation on macOS is always evolving in interesting ways so who knows what the state is now.)

3.2.3. Solaris / Illumos

The Solaris implementation (originally from MKS with a copyright of 1985!) is notable for implementing WRDE_NOCMD by leveraging ksh's restricted mode.

Not many people may still be using this implementation, but this is pretty clever and I guess demonstrates that the commercial Unix implementations may not have been bad.

3.3. parsing shell syntax

So, instead of being simple, we can try the herculean task of implementing most of a shell in libc instead. The only libc I know of that does this is glibc, though (often old, broken) copies of its implementation are found widely, both in programs trying to get around systems like OpenBSD as well as other libcs. For example, uclibc has an old version of glibc's wordexp which has the fatal flaw that backtick can still be parsed from arithmetic expressions, so e.g. $[`touch foo`] will execute a command.

glibc avoids calling out to the shell except when it must, for command substitution. This results in an elaborate, 2500-line reimplementation of some of a shell parser, including a single 800-line function to parse parameter expansions.

It omits many details; for example, in arithmetic expansion. Many valid (and useful) expressions like $((1<<16)) will not be recognized.

However it has one great merit. In 2014, Carlos O'Donnel fixed an important vulnerability. Previously, the code tried to enforce WRDE_NOCMD only when it recognized command substitution's two forms, like many of the implementations which lean on the shell for everything. Though it seems obvious in retrospect, glibc's current implementation guards the actual execution of commands with WRDE_NOCMD tests, instead of trying to do this during parsing.

Despite its complexity, it does seem like the only implementation that is safe to use, though it seems a better policy is to declare a ban on wordexp.

4. conclusion

I had never heard of wordexp(3) until I saw it mentioned in the POSIX standard, while I was implementing a shell.

Many programs that just use wordexp("~/foo") could be replaced with glob(..., GLOB_TILDE). (Though keep in mind, anyone can crash your program with a bad glob.)

I'm not sure where it was first implemented; scanning the Unix history repo, it doesn't really appear until the 90s but there's a reference in a manpage from FreeBSD 1.0 (which doesn't implement it). Let's hope it fades into history similarly.

(Post-publishing addendum: a friend alerted me to the POSIX rationale for wordexp (which I should have read to begin with). It seems the problem was that people kept demanding more and more features for glob, so it seems like the committee threw up their hands and added the far-too-broad wordexp in attempt to cover all possible bases.)

Footnotes:

jailkit uses wordexp in its restricted shell; to be fair, this is only enabled if you use the allow_word_expansion option which is disabled by default, and there's always the chance it will link with a safer wordexp like in modern glibc. However, it ships with a vulnerable wordexp bundled, for systems without it, and for example jk_lsh -c '$[`touch foo`]' will do bad things in such a case.

sudo has some clever logic to stuff the WRDE_NOCMD flag into calls to wordexp in its noexec mode; unfortunately, as demonstrated in this post, that's not sufficient to prevent execution from happening in all cases.

Debian's excellent codesearch service provides a quick way to find calls: https://codesearch.debian.net/search?q=%5Cbwordexp%5C%28%5Cb&literal=0&page=1

⁴

Consider this example:

$(case $(case $x in *) (echo $x);; esac) in (x) $(echo :);; *) $(echo :);; esac)

⁵

bash has a (disabled by default at compile time) --wordexp option, which is an attempt to provide this kind of functionality more safely. It tries to disable command substitution everywhere and only invokes the parser and expander immediately. Last time I checked, macOS didn't use this.

Some talks

2017-11-02T02:30:00+0000

Over the course of 2016 and 2017, I gave a few talks in public. I wanted to link to all of them, in part because it forces me to release the slides and code associated with them, but also to say a little about the value of public speaking and putting oneself out there.

Like many people, I have a hard time putting myself in front of people and exposing myself to criticism. (I've also spent a lot of my life being intensely private and avoiding visibility of any kind.) I often feel an obligation to perfection, and I came to realize that I don't hold other people to this same standard; I can forgive well-intentioned mistakes in others, so why not myself?¹

I came to realize that it was impossible for me to get everything right every time, and that as long as I graciously accept corrections, and try my best, I'm still providing value. In particular, I have often discovered that a lot of things I consider well-known are exciting and new to a lot of other people. Also, I just plain enjoy telling people about things I find exciting.

This is a motivating factor behind blogging for me, even as I'm acutely aware of the imperfections that necessarily come with these posts. (It also helps that the audience of this blog is vanishingly tiny, so I don't feel bad about expressing myself in my habitual tangly, verbose, parenthetical style.)

I hope that, then, reading this, you'll consider speaking on something you find interesting. You don't have to be an expert: your own passion plus diligent preparation will do fine.

A good way to get started is to speak at a local meetup (if you're in or visiting Montreal, please consider speaking at Papers We Love Montreal (PWLMTL)!). After that, why not submit proposals to conferences you'd like to go to anyway? You have nothing to lose.

I should mention that !!Con might be my favorite conference I've ever gone to, so far, and so I encourage you to submit a talk next year.

1. What if your NIF goes adrift? (Erlang Factory 2016)

(Slides; slides source coming when I can find it)

I was working a lot with Erlang NIFs, doing a lot of debugging and benchmarking, and ended up building a tool to make some parts of this easier (niffy). This talk was both an introduction to niffy and an opportunity to share some tips and warnings from that experience.

2. Think Outside the VM: Unobtrusive Performance Measurement (Erlang User Conference 2016)

(Slides; slides source)

This talk was about something I'm still passionate about: safely inspecting applications in production, in particular to diagnose performance issues (which can be hard to reproduce outside of production environments).

Like the previous talk, this was partially driven by extrospect-beam, a tool I had been working on, and having the talk coming up was good motivation to make a release. (And this post is a good reminder for me to pick up that work again and make the tool much easier to use, and perhaps modernize it with some eBPF approaches that have become available.)

3. Implementations of Timing Wheels (Systems We Love 2017, PWLMTL 2016/11)

(Slides source)

I was thinking a lot about expiry and rotation in many different contexts, and got very interested in timing wheels. I gave a talk that was something like an hour and a half at Papers We Love, and then somehow condensed it into twenty minutes for SWL. Needless to say this compression resulted in some loss, and you can see I'm a bit agitated in the video, mostly because of the time limit; my earlier attempts to give the cut down version had still been forty minutes long.

4. the Emoji that Killed Chrome (!!Con 2017)

(Slides, bug report)

If twenty minutes is hard, ten minutes seemed impossible. But this was a fun little debugging story and I ended up very glad I made this presentation. I think I fail to make the essential connection to surrogate pairs in this talk, which is unfortunate, as this is a key part of why this isn't valid UTF-8. (See these entries in the Unicode FAQ for a start.) The audience remained very supportive however, which helped a lot.

5. Simple Fast Algorithms for the Editing Distance between Trees & Related Problems (PWLMTL 2017/09)

No slides, since I like doing these talks on a whiteboard, but all the papers cited in the comments of the event page (which I can't seem to link directly to, but which live at the bottom) are a pretty good read. Here's the code that was presented: (looks like it could use some refactoring)

There's no video for this talk, because we don't record talks at PWLMTL. I get questions about this a lot, and wanted to say a few words about it.

5.1. Why PWLMTL doesn't record talks, and allows interruptions

Normally, questions in conference talks are kind of awful. You're on a strict schedule, so interruptions and digressions are just going to derail the speaker, and most of the questions at the end are usually more "well, actually" than elucidating.

I wanted to create an environment unlike that, where people, both attendees and presenter, would feel comfortable asking "dumb" questions, making mistakes, and exploring digressions. I know that if a talk is being recorded, I am much less comfortable asking a question.

So this helps PWLMTL function more like a discussion group than a conference talk, and I think this is valuable. I wish there were more things like this, but unfortunately they inherently don't scale. (Both in number of attendees, and in the number of these talks you can have in a day: PWLMTL benefits from a very flexible schedule, where talks can go from forty-five minutes to three hours long, depending on the stamina of the audience.)

(This still doesn't solve the problem that many people aren't comfortable speaking up, but I think trying to keep the environment as encouraging to speaking up as possible, as well as having a moderator facilitating the conversation (and curtailing those who are talking a little too much…), is better than the alternatives. I can't stand talks where I feel like the message is "don't ask questions"; why should I even attend?)

6. Conclusion

Hopefully you enjoyed at least one of these talks. If not, that's ok; I think my future talks will improve.

If you think even speaking at a meetup group is intractible, how about speaking to a small group of your peers? Many companies have lunch time or Friday afternoon activities where you can give a talk or lead a workshop; and if yours doesn't, you could start one. (The same applies if you're a student, of course.) My experience is that these things are greatly appreciated and become an important part of a company's culture.

Footnotes:

Incidentally, if this really resonates with you, I can highly recommend Jeff Szymanski's the Perfectionist's Handbook; unlike many similar books, it doesn't simply urge you to lower your standards, and has much practical advice.

Fixie tries

2017-10-29T02:30:00+0000

tl;dr: Here's a trie you probably don't want to use, but you might find interesting: an x86-64-specific popcount-array radix trie for fixed-length keys. The code (in Rust) is on GitHub.

In which I discuss a slightly-hackish minor trie variant, the fixie trie. There are already so many kinds of tries¹: PATRICIA, Judy array, hash array mapped trie (HAMT), crit-bit, qp-trie, poptrie, TRASH, adaptive radix trie (ART), …

This is without even getting into suffix tries and so on. There are almost as many trie variants as there are heap variants. (In a way, hierarchical timing wheels are also a kind of trie structure.²)

So although I'm giving this one a new name, it probably already exists in some form. My apologies to whoever's named variant I may have stepped on. Where did yet another trie variant come from?

1. Crit-Bit Tr[ei]es

When I first stumbled on Dan Bernstein's description of crit-bit trees, I got really excited.³ He paints a pretty compelling picture: a data structure simpler, faster, smaller, and more featureful than hash tables. Who needs these hash tables anyway?

Of course, it turned out that in practice, at least for the critbit trees described by djb, all that pointer-chasing wasn't great, and it's not straightforward to adapt to types where you don't have a natural sentinel like NUL.

Then I found Tony Finch's qp-tries which are even better, but first let's go through HAMT.

2. array mapped tries

Marek's introduction to HAMT is better at explaining this than I am, but briefly: there's this great trick you can do if you have a fast way to count the 1s in a bitmap (this operation is called population count, or "popcount" for short).⁴

Let's say you want a map from the integers [0,8) to elements in an array. This is such a small number that you could just allocate eight elements and your "map" is just array indexing, but let's pretend eight is a larger number, since I didn't want to draw a bunch of sixty-four-element-wide diagrams in this article.

If you wanted to only allocate space for the elements that are present in this map, you could use an 8-bit bitmap to indicate presence, and store the elements in order. Now, popcount of the whole bitmap gives you the number of elements present; the individual bits tell you a given element is present; and the popcount of the bitmap masked by the bits lower than that element gives you its index in the array.

Phil Bagwell presents this idea in Fast and Efficient Trie Searches to yield array mapped tries (AMT)⁵. Later, in Ideal Hash Tries, he builds on these to yield hash array mapped tries (HAMT), which have become widely implemented.⁶ ^,⁷

This popcount trick appears in all sorts of other places, for example in unrelated travels I just ran into it in the Caroline Word Graph as a consequence of reading Appel and Jacobson's The World's Fastest Scrabble Program. (Which I now notice is also cited by Bagwell in the aforementioned paper.)

(BTW, Bagwell's other papers, such as Fast Functional Lists, are also well worth reading!)

3. qp-tries

Now, qp-tries sort of combine HAMT and crit-bit trees; they work with larger chunks instead of a bit at a time; the original description was a nibble at a time, so a 64-bit word uses 16 bits for a bitmap representing possible nibbles, and the rest of the word stores the index of the nibble being tested (the "critical nibble").

I found qp-tries inspiring in part because of the lucid implementation which yielded great performance with simple code. Check out qp.h for a nice overview in the comments, and qp.c for the implementation.

qp-tries are great for lots of cases, especially strings; I wanted to go on about them but in general I suggest just checking out everything on the qp-trie homepage.

4. canonical pointers on x86-64

For whatever reason, this never came to me when I encountered HAMT, but my ears perked up when I heard 16, 48, 64, in the description of qp-tries. In practice, x86-64 machines only use 48 of the bits in a pointer (64 bits); what about the other 16? Intel has the following to say about addresses: (SDM volume 1, 3.3.7.1 Canonical Addressing)

In 64-bit mode, an address is considered to be in canonical form if address bits 63 through to the most-significant implemented bit by the microarchitecture are set to either all ones or all zeros.

Intel 64 architecture defines a 64-bit linear address. Implementations can support less. The first implementation of IA-32 processors with Intel 64 architecture supports a 48-bit linear address. This means a canonical address must have bits 63 through 48 set to zeros or ones (depending on whether bit 47 is a zero or one).

Although implementations may not use all 64 bits of the linear address, they should check bits 63 through the most-significant implemented bit to see if the address is in canonical form. If a linear-memory reference is not in canonical form, the implementation should generate an exception.

(Note this means that pointers are in a sense "signed" on this platform!)

So, we get no guarantees that we can always use those upper 16 bits, but this is the case presently⁸: they must always be all zeros, or all ones. And, at least on Linux, the kernel uses the "negative" addresses (whose 16-bit prefix is all ones), so our userspace pointers are always zero there. What a tempting place to stuff some extra data!

5. fixie tries

Then, earlier this year, I was laying the groundwork for potentially developing a service in Rust at a company where previously such services had been written in C. I wanted to write some code that would prove to me that it wouldn't be too painful to drop to the level of pointers, bits, and syscalls that we needed. I started with the magic ringbuffer, and then a popcount-based tiny compact map.⁹

Finally, I needed a structure where I could space-efficiently map fixed-length integers to values. What if we gave up critbit's nice property of compressing paths where the bits are the same, and reconstructed the key by traversing the trie? Could we stuff that bitmap into the unused bits, and end up with a single word per branch?

That's basically fixie tries: we use the lowest bit on a pointer to determine if it's a branch or a leaf (since pointers must be word aligned, we actually have a few free bits at the bottom). If it's a branch, we cut off its sign extension and put a bitmap there; the pointer that results from masking the branch with 0x0000_ffff_ffff_fffe points to an array containing the children indicated by the bitmap.

If a leaf is at the deepest level in the trie, it just points directly to a value, since we can reconstruct the key as we walk to it. If the leaf is somewhere shallower, we store a tuple of the full key and the value; there are probably improvements to be made there, but the hassles of aligning a partially-nibbled key have kept me away from them.

To be clear, though qp-tries were what sparked the flame, these are more like Bagwell's array mapped tries. They're x86-64-specific¹⁰, in a way that you probably don't want to depend on, and they're not magically better than everything else, but they have a few nice properties.

6. Benchmarking and performance

Sadly I mostly had a puny AMD E-350 to benchmark this on, but thankfully a friend ran it on a machine identifying itself as packing 16 × Intel(R) Core(TM) i7-7820X CPU @ 3.60GHz.

Random insertions of 32-bit keys as a set: (it would be nice to add Roaring or similar to this benchmark)

test random_u32_insertions_on_hash_set          ... bench:         134 ns/iter (+/- 139)
test random_u32_insertions_on_fixie_trie_as_set ... bench:         293 ns/iter (+/- 26)
test random_u32_insertions_on_btree_set         ... bench:         343 ns/iter (+/- 44)

Random insertions of 64-bit keys and 32-bit values:

test random_u64_insertions_on_hash_map          ... bench:         154 ns/iter (+/- 158)
test random_u64_insertions_on_fixie_trie_as_map ... bench:         266 ns/iter (+/- 32)
test random_u64_insertions_on_btree_map         ... bench:         435 ns/iter (+/- 44)

One thing that's part of djb's argument for critbit trees that remains true for fixie tries is relatively uniform performance. We can see that hash inserts are fast, but have a lot of variance.

Random queries on 32-bit keys, acting as a set:

test u32_random_queries_on_btree_set               ... bench:           5 ns/iter (+/- 0)
test u32_random_queries_on_fixie_trie_set          ... bench:          22 ns/iter (+/- 0)
test u32_random_queries_on_hash_set                ... bench:          29 ns/iter (+/- 0)
test u32_random_queries_on_a_qp_trie_set           ... bench:          59 ns/iter (+/- 0)

I was surprised by how good BTreeSet is here; some of my early benchmarking (at the beginning of the year) indicated otherwise, but it (or my methodology) must have improved.

Random queries on 64-bit keys for 32-bit values:

test u64_random_queries_on_fixie_trie              ... bench:          23 ns/iter (+/- 0)
test u64_random_queries_on_hash_map                ... bench:          31 ns/iter (+/- 0)
test u64_random_queries_on_btree_map               ... bench:          93 ns/iter (+/- 1)

BTreeMap starts to fall behind as the key size grows here.

Memory usage (maximum RSS of process) after random insert of n 64-bit keys associated with 32-bit values:

	10000	100000	1000000	10000000
`btree_map`	2640 kB	4776 kB	25900 kB	237000 kB
`fixie_trie`	2852 kB	5508 kB	31848 kB	286212 kB
`hash_map`	3272 kB	8596 kB	76208 kB	592096 kB
`qp_trie`	3180 kB	10944 kB	93796 kB	809116 kB

As we can see, BTreeMap is actually quite compact, which I didn't expect; since we're measuring the whole process's usage, this takes into account allocator fragmentation and so on. It's likely that with a custom allocator, fixie tries would be much more compact.

Overall, despite still having a lot of relatively low-hanging optimization opportunities, fixie tries do ok here: they use much less memory than hash maps and qp-tries, and have consistent performance for inserts and queries that is never slower than the alternatives tested.

7. Tangent: benchmarking caveats

Aside from the regular disclaimers, here's something specific to how this benchmark is written. What's wrong with this approach?

let mut rng = rand::thread_rng();
let mut t = FixieTrie::new();
b.iter(|| {
    for _ in 0..N_INSERTS { t.insert(rng.gen::<u32>(), ()); }
});

It makes fixie tries look great compared to BTreeSets. You should always scrutinize a benchmark, especially when it surprises you, but most of all when it confirms your own beliefs (or desires). This is when it is most tempting to stop looking and publish your results. (This making it a rhetorical benchmark.)

We don't recreate the trie every time in this test, so the trie gets progressively fuller as the iterations continue. Maybe fixie tries are good in this situation, but it's not what we meant to test, and isn't likely to give us much accuracy since the real workload will change every iteration.

Unfortunately, cargo bench doesn't give us the ability to do any kind of setup/teardown between iterations, so we're forced to use patterns like this:

fn random_insertions_on_a_set<T: Rand, S: Set<T>>(b: &mut Bencher, new: fn () -> S)
{
    let mut rng = rand::thread_rng();
    let mut s = new();
    for _ in 0..N_INSERTS { s.insert(rng.gen()); }
    b.iter(|| { s.insert(rng.gen()); });
}

I can't fault cargo bench too much here, because it does so much right, out of the box: it warms up the code under test before trying to measure, and uses somewhat robust statistical measures to determine when it's safe to stop (rather than using the mean and standard deviation as you'll see in a lot of benchmarks). Also, it takes a lot of fiddly, system-specific code to be able to measure a single iteration with accuracy, so it would be difficult for it to provide this interface.

But since the underlying structure is changing on every iteration, we may not be measuring what we think we're measuring. At least filling it partially before starting the timing puts the inserts-under-measurement in more consistent territory.

8. Tangent: debugging a jemalloc deadlock

After updating this crate to use the latest Rust allocator API, some tests would hang. Digging in with gdb revealed they were deadlocked, trying to lock a mutex they already held. Although I was 99% sure this was a bug in my code, jemalloc has had its fair share of deadlock bugs in the past so I held a glimmer of hope I could find something really interesting.

The tests didn't hang with the system allocator and valgrind saw nothing wrong. So I dug into the code and learned a lot about jemalloc's internals; especially I learned that there's a lot of tricky locking in the thread cache code, which surprises me since the point of the thread cache is to reduce synchronization overhead, but I guess I was looking at all the slow paths. I observed that modifying lg_tcache_max (the largest size-class kept in the thread cache) changed at what point it deadlocked, but not in a predictable way (making a table of values to progress, there was a general trend that the larger the classes allowed, the more progress it made, but not consistently).

One of the interesting things was how predictable the failure was. Thinking I might be doing something to corrupt jemalloc's structures badly enough to corrupt the lock, I scripted gdb to print the state of the lock at the points where it was locked and unlocked in the function where it deadlocks, which revealed that the locks looked fine in that function. Unfortunately, because everything was optimized out, it was hard to introspect the structures I thought might have been getting corrupted; I started building rust with a debug, assertion-laden jemalloc, but I knew it would take all night on my laptop.

Before I called it a night, I scripted gdb to print all the calls to mallocx, rallocx, and sdallocx, and their return values, and started putting together a sed script to transform this into a C program that made the same sequence of allocations and deallocations.¹¹

When I woke up the next morning, I realized there was something suspicious about the deallocations logged. I tested with the system allocator, but with various allocators LD_PRELOAD'd, including the same version of jemalloc that Rust was using; none of these hung. So I asked myself, what's the difference?

Of course, Rust is using the sized deallocation API (sdallocx), and the system allocator will be going through malloc and free and not passing any sizes. Looking again at how my library was calling dealloc, I spotted the bug instantly; I was claiming some things I was freeing were smaller than they actually were. Changing this fixed the bug. I took a look through the paths that sdallocx would take if given smaller sizes, and it looks like if my build of Rust with jemalloc assertions enabled had finished compiling, it would have detected my mistake, but otherwise, supplying the wrong size would cause havoc in the thread cache later.

You might say that I should have known to look at these things first, but such is the nature of bugs.

9. Tangent: property testing saves the day

I would have zero confidence in this if it weren't for property testing. Although not as full-featured as the libraries for some other languages, Rust's quickcheck is extremely easy to integrate into the development process, and found tons of bugs that my own, manually-crafted unit tests never would have turned up.

A nice property we have when making a data structure that obeys some common interface is that we can just test against a known-good structure that obeys the same interface:

#[derive(Copy, Clone, Debug)]
enum MapOperation<K: FixedLengthKey, V> {
    Insert(K,V), Remove(K), Query(K),
}

impl<K,V> Arbitrary for MapOperation<K,V>
    where K: Arbitrary + FixedLengthKey + Rand,
          V: Arbitrary + Rand {
    fn arbitrary<G: Gen>(g: &mut G) -> MapOperation<K,V> {
        use self::MapOperation::*;
        match g.gen_range(0,3) {
            0 => Insert(g.gen(), g.gen()),
            1 => Remove(g.gen()),
            2 => Query(g.gen()),
            _ => panic!()
        }
    }
}

quickcheck! {
    // A small keyspace makes it more likely that we'll randomly get
    // keys we've already used; it's easy to never test the insert
    // x/remove x path otherwise.
    fn equivalence_with_map(ops: Vec<MapOperation<u8,u64>>) -> bool {
        use self::MapOperation::*;
        let mut us = FixieTrie::new();
        let mut them = ::std::collections::btree_map::BTreeMap::new();
        for op in ops {
            match op {
                Insert(k, v) => { assert_eq!(us.insert(k, v), them.insert(k, v)) },
                Remove(k) => { assert_eq!(us.remove(&k), them.remove(&k)) },
                Query(k) => { assert_eq!(us.get(&k), them.get(&k)) },
            }
        }
        us.keys().zip(them.keys()).all(|(a,&b)| us.get(&a) == them.get(&b))
    }
}

10. Further directions

This allocates a lot, and buffering writes could help, as well as a custom allocator. Bagwell's papers talk about the importance of knowing the common allocation patterns and tailoring the allocator to them.

Dropping the trie is expensive right now. This is something that would be a lot faster with a hierarchical or region-based allocator, as long as the keys and values don't implement Drop; you would much rather throw the whole area away than having to walk it just to dispose of it, which is what I do in the current implementation.

We still have some unused bits in our branch structure. If prefix-compression ended up being useful, it could be stuffed into one of those bits. As it is, though, I like that the pointer structure is pretty simple.

Tony Finch has explored lots of qp-trie variants; some of the same ideas are also applicable to fixie tries. Prefetching made a noticeable difference for qp-tries.

There are some interesting concurrent tries in the same vein, including CTries, and to a lesser extent poptries. I thought about a reasonable scheme for lockless insertions in fixie tries, but haven't had time to implement it.

It might be nice to do something like direct pointing, also from poptries, where the first n bits are covered by a 2ⁿ array of tries. This is a little reminiscent of the two-level approach of Roaring bitmaps.

Finally, there are still a lot of interface niceties missing from this. For example, it really should implement Entry but I haven't gotten around to it yet.

Let me know if you end up using fixie tries or just end up taking inspiration from this world of trie techniques to build you own little trie variant.

Footnotes:

First digression, pronunciation: ever since I found out that trie comes from the middle of the word re-trie-val, I have pronounced this word as "tree", confusing everyone. Paul E. Black agrees, but Knuth says "[a] trie — pronounced 'try'".

Usually all these naming prefixes (suffix trie, PATRICIA trie, qp-trie, fixie trie) are enough to distinguish them.

I gave a brief talk on timing wheels (and other ways to implement timers) at Systems We Love in Minneapolis this year.

I think Adam Langley's literate programming treatement of critbit trees is a nice way to explore them.

⁴

See also popcnt in SSE4, and Wojciech Muła's work doing popcount on larger bitmaps more efficiently.

⁵

And, it ends with a paragraph that has more significance to me since Are jump tables always fastest?:

Finally, its worth noting that case statements could be implemented using an adaptation of the AMT to give space efficient, optimized machine code for fast performance in sparse multi-way program switches.

⁶

For example, Erlang's not-quite-new-anymore map type is implemented as a HAMT, at least for more than 32 elements. This is why it's important to build your Erlang VM with -march=native.

⁷

I'd be remiss not to mention recent work (Optimizing Hash-Array Mapped Tries for Fast and Lean Immutable JVM Collections) that's been done on improving HAMT performance, as covered in The Morning Paper.

⁸

But as /u/1amzave on lobste.rs points out, 57-bit addressing is coming.

⁹

I coincidentally started reading Purity and Danger while writing all this unsafe code and thinking about the cultures of C and Rust.

¹⁰

Could we do this on another platform? If you use indexes into an array you control, or if you control the memory allocator and use a BIBOP-style approach where you know what all trie pointers will look like, you could probably do the same thing.

¹¹

Minimal repro, in case you too would like to explore the guts of jemalloc's locking code:

#include <jemalloc/jemalloc.h>

int main(void)
{
    void *p = mallocx(9, 0);
    sdallocx(p, 2, 0);
    enum { N = 38000 };
    void *q[N];
    for (int i = 0; i < N; ++i)
        q[i] = mallocx(8, 0);
    for (int i = 0; i < N; ++i)
        sdallocx(q[i], 8, 0);
    malloc_stats_print(NULL, NULL, NULL);
}

Building shells with a grain of salt

2017-10-17T02:30:00+0000

The shell is at the heart of Unix. It's the glue that makes all the little Unix tools work together so well. Understanding it sheds light on many of Unix's important ideas, and writing our own is the best path to that understanding.

Earlier this year, at a place I worked, I decided to run a series of workshops on writing a Unix shell. A lot of questions had come up that I think writing a shell leads you through, as well as issues that suggested tenuous mental models of the shell and its scripting language.

A small sampling of those kinds of questions:

why does this Python replacement for this shell script deadlock?
why are there so many Unicorn processes on this server?
when are variables quoted in a shell script? why aren't they always visible to other programs I run?
how does set -e work?
how does control-C (^C) work? why do I need to use ^\ sometimes instead? why doesn't ^C work the way I'd expect on this bash for loop?
what's special about init?
why do we do these steps in daemonization?

At this company, we had a regular Friday afternoon workshop/lecture series. I had previously tried to do an overview of Unix processes relationships, but it felt too abstract. So, I tried to make it more concrete by getting everyone to actually implement a shell.

Initially, this was just a rough layout of what I thought I could cover in each session, and pointers to manpages. I never turned this into the full, DIY, self-paced tutorial I had hoped, but (in the spirit of release early, release often) I am opening up my work in progress at https://github.com/tokenrove/build-your-own-shell.

This isn't "finished", but if you're ambitious, you should be able to make something that passes all the tests. I decided it would be better to put it out there, even in rough form, than keep it sealed up. After all, a number of people enjoyed the workshop out of which this came.

(Caveat for macOS and *BSD users: there's still something wonky about the timing in the section that tests signals and job control; hopefully by the time you're reading this, I'll have it worked out, but if not, I apologize.)

In this post I'll reflect on some choices I made, and follow a few tangents that come up in the text but would be disruptive there.

1. Testing shells

I decided that, for this to be useful for self-study, it should contain an automated test suite. I love Tcl and expect, and had figured it would be a natural tool for testing the interactive components of shells. I took a quick look at how other shells were testing themselves. Most were strictly non-interactive tests, using shell scripts and comparing with expected output. A nice exception here is fish, which indeed uses expect for its interactive tests.

This makes sense, but I wanted to focus on interactive shells: in part because so many tutorials ignored the considerations of interactive shells, but also because I felt people would enjoy themselves more if they could use the shell they were writing directly.

I started with some tests edited from the output of autoexpect, but this turned out to be too fragile. Something I noticed in the first workshop was that people really enjoyed customizing their prompt; this should be no surprise (prompt customization is a perennial time-wasting activity in any shell), but it meant I'd have to be careful about how I matched outputs in tests. In particular, I couldn't really depend on detecting and matching the prompt.

The other tricky thing is that I couldn't use any feature in the tests that hadn't been developed yet, so using conditionals or echoing $? wasn't possible in the early tests.

I considered writing a wrapper using ptrace(2) that would watch for all fork / execve / wait syscalls from the shell and its children, and print those in a form easily consumed by a test harness (this seemed easier to do than cleaning up the output of strace), but things like prompts that exec git every time, as well as ptrace's noted stubbiness on macOS, prevented me from going further with this.

So that's where the workshop sat for a long time, until I finally decided to use a little test description language in place of expect scripts directly. So now a typical test might look like:

→ true || false || echo-rot13 foo⏎
≠ sbb
→ false || true && echo-rot13 foo⏎
← sbb
→ exit 42
☠ 42

For whatever reason, expressing things this way allowed me to finally write out all the tests I had intended to have, without focusing too much on the implementation of the test harness. Then I wrote some Tcl to interpret these files.

I decided to go with string matching in the output, which is not particularly robust, but is simple. Because of discrepancies between how different shells and TTY drivers draw things, it can be prone to matching the echoed input as the output if one isn't careful. There are also some timing issues; the script written by autoexpect suggests inserting a 100ms delay between each keystroke sent, but this makes the tests cripplingly slow; I'm still trying to find a tuning that is reliable across systems but speedy enough to be usable.

I decided that bash and mksh should pass all the tests, and cat should fail every test. There's nothing worse than a test that fails to actually test something. This reminds me of the admonishment "don't try to do what a corpse can do better": goals phrased in the negative (like "stop reading Hacker News") are hard to achieve — the dead (or cat) will always do them better than you. Positive goals (and tests) are more actionable.

There are still some timing issues on different platforms, but I don't regret making the simple choice for now.

2. Minimal shell builtins

Doing the workshop lead me to think about minimizing shell builtins; one of the questions that comes up a lot is why cd needed to be a builtin, but what doesn't come up until one is much deeper into pipelines and job control is what a pain builtins are, in how they interact with the rest of the shells features. It would be nice to get rid of them.

There are some commands which are builtins only to make them fast, like echo, true, and false. These usually have equivalents in /bin already.

Some builtins are required because they modify the shell's own environment: cd, exit, fg, bg, jobs, exec, wait, ulimit. (This is excluding really tricky, impractical things, like using shared memory, process_vm_writev, or ptrace to modify the shell from an outside process.)¹

To prove a point, you could take functional programming to an extreme and have an immutable shell where cd executes a new shell in the chosen directory, but some of the others are probably not possible in the presence of typical job control.²

If we take this line of thought further, we can try externalizing some of the shell's operators. Conditional execution is interesting. How about && and ||? Syntactically, we probably can't pull these off as external commands, but we could provide commands and and or which take commands to execute.

Implementing if is an obvious next step from and and or. Now we can implement while, although we'd have to be careful about how we handle the environment if we wanted to handle many typical uses of while.

The for loop almost already exists in this form, as xargs. We would probably want to provide both a sequential version, where the environment for each iteration depends on the previous, and a parallel version where everything can run at the same time.

Note that most of these approaches require that you have mechanisms for escaping that aren't too cumbersome, for them to be practical. There seems to be a close parallel with macro facilities in languages like Lisp.

At the extreme side of cumbersome quoting would be case, which you'd probably want to take its input from a heredoc.

I was originally going to write a proof of concept of this (called "builtouts"), but researching this lead me to the intriguing execline "shell", which has already done this, and explored this space rather nicely.

One thing that execline doesn't seem to do is implement something resembling real job control. If bg executes a command without waiting and then re-executes the shell with a suitable variable set (to the PGID of this job), the shell on each execution can check this variable to see what jobs are still alive; the jobs command can print the contents of this variable; the fg command just becomes tcsetpgrp and wait with the PGID of the current job. For an interactive shell, the tricky thing is probably making sure that bg's children don't end up in an orphaned process group.

A lot of these programs end up having to deal with quoting. Is there a way to take this further and handle quoting in its own program? For fixed-arity programs (like if), we can imagine an unquote helper that calls a subsidiary program with, first, the fixed remaining arguments, and then all of the original quoted argument, expanded, as the remaining arguments.

As glob(7) notes:

Long ago, in UNIX V6, there was a program /etc/glob that would expand wildcard patterns. Soon afterward this became a shell built-in.

Luckily, the source is available in Diomidis Spinellis's unix-history-repo, and we can see that it does this same kind of chain loading, executing its first argument with the rest of its arguments expanded according to the globbing rules.

I especially enjoy the extremely primitive path search and shell script support.

3. Objet trouvé engineering

(Found object engineering, often called cargo cult programming.)

Now we get to the inflamatory bits, for those who kept reading.

Stackoverflow modernized, but did not create, the practice of assembling Frankenstein programs from poorly understood and imitated examples, but I think no language has been more greatly affected by this than shell, as evidenced by the bizarre ready-made shell scripts one can encounter almost everywhere. Sometimes, the evolution of these patterns reminds me of semantic drift in languages.

A lot of constructs are poorly understood and misused. I'm not blaming people, though; part of the problem is that I can't easily point to a single, modern reference work that someone should read before writing shell scripts. And, since shell scripts often feel like "configuration" rather than "programming", I imagine people don't even think about learning shell as a programming language.

Writing a shell helps disabuse people of some common confusions, for example that:

bash is shell scripting, definitively, and if you write a shell script, it is a "bash script" ("all the world's a VAX" and all that);
quotes make something into a string;
double and single quotes are interchangeable;
the argument to if, while, et cetera is something magical;
writing export FOO=x repeatedly does something;
variable case has some magic properties;
et cetera.

(Don't forget to use shellcheck and checkbashisms everywhere!)

4. Why write a shell

In the workshop, I cite the following motivations for writing a shell:

to give you a better understanding of how Unix processes work;
- this will make you better at designing and understanding software that runs on Unix;
to clarify some common misunderstandings of POSIX shells;
- this will make you more effective at using and scripting ubiquitous shells like bash;
to help you build a working implementation of a shell you can be excited about working on.
- there are endless personal customizations you can make to your own shell, and can help you think about how you interact with your computer and how it might be different.

I've already touched on the first two, but the third is maybe less obvious. The shell remains a ubiquitous interface, decades after we imagined other modes of interaction would replace it. The field is ripe with opportunities for improvements.

There are a lot of people exploring this space in interesting ways, but I think there's room for so much more.

A lot of existing tutorials focus on the non-interactive case, and I think people will have more fun if they build a shell they can use interactively.

Aside from the interactive case, a lot of infrastructure is held together with shell scripts.

There's a commonly held belief that scripting languages like Perl, Ruby, and Python are complete replacements for shell scripting. My own experience is that these languages lack the expressive tools of the shell for working with pipelines, exit statuses, redirections, and so on, and the replacement code is often:

sequential rather than parallel, and often much slower for this reason;
full of deadlocks, race conditions, and signal handling issues;
much more verbose and less clear than an equivalently carefully-written shell script.

So, I feel there's still room for new tools in this space, too.

5. Conclusion

If you've been thinking about writing a shell for a while and haven't gotten around to it, why not try my workshop or any of the tutorials it links to?

Footnotes:

POSIX avoids dictating exactly what must be a builtin, but does specify that the following commands must be executed no matter if they are in the path:

alias bg cd command false fc fg getopts jobs kill newgrp pwd
read true umask unalias wait

Most of these have something to do with the shell's internal state, but not all.

This is a kind of chain loading, sometimes called Bernstein chaining. There's a lovely discussion of this in Andy Chu's Shell has a Forth-like quality. (The entire oil shell blog is full of great stuff.)

Return to the Source

2017-10-05T02:30:00+0000

If a system is to serve the creative spirit, it must be entirely comprehensible to a single individual. — Dan Ingalls

I saw Ellen Ullman speak last night, about her new book,¹ and the topic turned to culpability for Y2K, systems that people never expected would run for decades, and systems that no one understands any more.

When a Peterborough nuclear facility reached out to retrocomputing enthusiasts looking for someone who knew PDP-11 assembler, I started thinking about the Foundation series (warning: possible spoilers follow). The idea that was most striking to me, in those books, was that eventually, societies who became comfortable with advanced technology could end up losing the knowledge of how that technology worked (cast in a very '50s nuclear vibe).² I encounter a lot of people dismissive of the importance of systems programming (moreso online than IRL, thankfully), and it makes me wonder if we are rapidly heading in that direction.

Ullman talked about "returning to the source" — extracting the lost knowledge from code whose authors aren't around anymore. There couldn't have been a more serendipitous time for this, as I had just been discussing the merits (and pitfalls) of reading source with my fellow Recursers.

It's my sincere belief that code is the source of truth in computing (and by this, I also mean machine code, which is also worth reading; the success of Matt Godbolt's Compiler Explorer tells me I'm not alone in this). So I am writing this article to exhort you to read code written by someone you don't know, today, to save the future.

1. Why Read? Craftsmanship

Jon Bentley opens his first Programming Pearl on literate programming with the following:

When was the last time you spent a pleasant evening in a comfortable chair, reading a good program? I don't mean the slick subroutine you wrote last summer, nor even the big system you have to modify next week. I'm talking about cuddling up with a classic, and starting to read on page one. Sure, you may spend more time studying this elegant routine or worrying about that questionable decision, and everybody skims over a few parts they find boring. But let's get back to the question: When was the last time you read an excellent program?

(I like to ask this question in interviews, on both sides of the table; not to be a snob, but to open up a discussion about reading code.)

I always remember this better the way Steve McConnell paraphrases Jon Bentley in Code Complete:

One especially good way to learn about programming is to study the work of great programmers. Jon Bentley thinks that you should be able to sit down with a glass of brandy and a good cigar and read a program the way you would a good novel.

The intent is clear: you can improve as a craftsperson by reading masterpieces of software, the same as writers need to read other works³ and musicians need to listen to other performances. Empirically verifying whether this is true is unfortunately outside my abilities, but I believe I've benefitted greatly from "reading the greats".⁴

One thing to clarify from those quotes, though: these always made me picture reading the source from top to bottom, and it turns out this isn't particularly effective.

I've been passing Peter Seibel's Code is not Literature around a lot lately; this is an article I didn't understand when I first read it. I thought it was an attack on code reading, but in fact it's a suggestion of much better ways to approach reading code, especially in a group.

There's an extension to that: I think Pierre Bayard's How to Talk About Books You Haven't Read expresses this far better than I can, but I think it's self-defeating to believe that "having read the code" is a binary state: either you read (and understood) it all, or you haven't "read it". Diving into a codebase is just the start of a long relationship with that code; you can keep coming back, to familiar haunts and undiscovered territories every time.

2. Why Read? Personal mastery

I started this article with my favorite Dan Ingalls quote, from the design principles of Smalltalk. I think there's a deep truth about software in that. We don't seem to be able to build abstractions that aren't leaky, so you're always going to need to be able to go up and down in the layers of abstraction in a system just to fix the problems at the level you care about.

What will a system that lasts 10000 years look like? I don't think it will be one that no one can understand. Is it possible to build complete systems that can be understood by an individual? The work by Viewpoints Research Institute seems to suggest it's possible. Until the day when we all have our personal 10kLOC operating systems⁵ committed to memory, Fahrenheit 451-style, reading systems large and small helps one grapple with the nature of complexity, and find a personal relationship with it.

And, pragmatically, reading your dependencies helps you answer the question: will anyone be able to understand this when it breaks? (And it will break: because of bitrot; because the assumptions changed.)

3. Why Read? Procedural rhetoric

Ellen Ullman also talked about how algorithms have biases; software isn't neutral⁶. This reminded me of Ian Bogost's concept of procedural rhetoric, where interaction with systems can be persuasive, and can communicate ideas and opinions, in a way that is subtle.

This is a deep topic in its own right, but I think the first step in being the future masters of technology⁷ is to understand the workings of systems we interact with, and the purest form of that is reading their code. Even when we can't read the code of many systems around us, reading the code of similar systems is a part of understanding how they might work and the biases implicit within them.

4. What's worth reading?

You might be wondering, "where do I even start?". I think there are two classes of code especially worth reading: code that you use, and code that is great. The latter is incredibly subjective; I have been compiling a list of what I think are "masterpieces of software" and will post it at some point, to much criticism, I'm sure.⁸

However, the former is straightforward for everyone: read your dependencies. Now that we live in a Free Software utopia (hah)⁹, you probably have access to the source of the vast majority of libraries, servers, tools, and systems you use and depend on. This is a wealth that is often squandered.

5. Literate programming

Fans of literate programming are probably champing at the bit, waiting for me to unveil Knuth's perfect plan for programmer literacy. (If you're not familiar with literate programming at all, I think Knuth's book is still the best treatment, even if it is pretty dated at this point.)

I have done a bit of literate programming (and I still think literate assembly language is the most useful application of these techniques); I've read a lot of literate programs; and as it relates to the topic of this article, I feel it's mostly irrelevant. There will always be the need to read unadorned, unpresented programs. Literate programs are lovely, but they aren't a complete replacement for the kind of code reading I'm advocating here (even if it was a common enough practice that one could find a reasonable supply of them).

6. Reading about reading

Sadly, there haven't been a lot of books about reading code. There are many books that intersperse commentary with code, but these aren't really about reading code; they're more like literate programs.

The only book I know to exclusively treat this subject is Diomedis Spinellis's Code Reading. It's been a while since I read it, but I remember feeling that it was a good start, but not the complete picture. The author has also compiled a lot of arguments for why code reading is important on that site, if you find this article unconvincing.

Michael Feathers's Working Effectively with Legacy Code is a truly great book, but I don't remember it having much concrete advice about actually reading legacy code. (I might be misremembering, of course.) However, it is about testing, and one of the great ways to read a codebase is to try to write tests for it.

7. Bonus: the tension of comments

I mentioned that I feel that code is the only truth of the system. (Which is a terrible oversimplification as almost all systems are also data-driven to some extent.) So it's unsurprising that I agree with Kernighan and Pike's advice on commenting in The Practice of Programming, which is often misinterpreted as "don't write comments".

When I read code (especially when I review code), I actually skip the comments (sometimes I strip them out) on my first pass through the code. I find that, because I'm much faster at reading English text than source code, it is easy for the eye to get comfortable reading the comments and only skimming the code. This deceives me into thinking I've actually read the code, when I haven't.

However I have an egregious example from Darwin (macOS) osfmk/kern/thread_call.c:

if (cancel_all)
        result = _remove_from_pending_queue(func, param, cancel_all) |
                _remove_from_delayed_queue(func, param, cancel_all);
else
        result = _remove_from_pending_queue(func, param, cancel_all) ||
                _remove_from_delayed_queue(func, param, cancel_all);

I often trot this snippet out when I ask people where their threshold for "too clever" code is. To me, my first reaction on seeing this was "this is a typo", and only after looking over it again carefully did I realize what it was trying to do, and then a while longer thinking as to whether it actually did that correctly.

But the most recent time I went to show someone this snippet, it turned out it had been updated, including adding some crucial comments!

if (cancel_all) {
        /* exhaustively search every queue, and return true if any search found something */
        result = _cancel_func_from_queue(func, param, group, cancel_all, &group->pending_queue) |
                 _cancel_func_from_queue(func, param, group, cancel_all, &group->delayed_queues[TCF_ABSOLUTE])  |
                 _cancel_func_from_queue(func, param, group, cancel_all, &group->delayed_queues[TCF_CONTINUOUS]);
} else {
        /* early-exit as soon as we find something, don't search other queues */
        result = _cancel_func_from_queue(func, param, group, cancel_all, &group->pending_queue) ||
                 _cancel_func_from_queue(func, param, group, cancel_all, &group->delayed_queues[TCF_ABSOLUTE]) ||
                 _cancel_func_from_queue(func, param, group, cancel_all, &group->delayed_queues[TCF_CONTINUOUS]);
}

Reading this code gave me an appreciation for a kind of comment I would otherwise have tended to omit.

8. Bonus: How to read a C program

I would be remiss not to end this with some concrete advice. Reading tips will vary by language, but a lot of code I read is written in C; how do I approach reading a C program?

When in doubt, start from the bottom. Occasionally someone tries to fight the natural C order of definitions by forward-declaring static functions; this is unnatural and most code isn't written this way. Instead, you'll generally see that if you want a "top-down" view, you should go to the end and work backwards. (Incidentally, this is even more true for OCaml / Standard ML programs, where the order of declaration is very strict.)

Use unifdef to get rid of as much that is irrelevant to you as possible, at least for an initial reading. Get rid of those paths that are only taken on Acorns and Ataris.

Use ctags, cscope, GNU global, and whatever other support you can find to be able to quickly jump to and from the definitions of identifiers, ideally also seeing all the places that refer to those identifiers. Cross-reference tools like LXR (on the Linux kernel, on MRI) are sometimes nice for this, although I often find them more cumbersome than using my editor on my local machine.

Look at the header files included; what are the data structures that get used all the time? Sometimes there's tangly stuff with macros, like the queue.h macros for intrusive data structures; it can help to run the preprocessor over the file (cc -E) or write the structure out by hand on paper and annotate it.

Don't be afraid to mutilate the program to understand it. Cut things out and try to compile it. Make hypotheses and validate them. Is this struct field used by anything? Let's cut it out and see what breaks in the compile. (Newer statically-typed languages tend to be even more receptive to those kinds of experiments.) Attach a debugger and set breakpoints at functions you're reading.

If you're reading a library, consider starting with an example program, tracing through the API calls made, into the guts of the library.

Maybe you also have the version control history (it's wonderful that we can start almost taking this for granted). When you find something interesting, dig back with blame; what changed, and why? Also, if the code seems too complicated, it can be helpful to start from an early revision and then work forward in history. Seeing the code adapt as imagination encounters the real world paints a picture of evolution as vivid as any archaeological exhibit.

Serendipity can also be good. Seibel, cited above, describes "play[ing] the role of a 19th century naturalist returning from a trip to some exotic island to present to the local scientific society a discussion of the crazy beetles they found". Sometimes I like to just peek at different parts of the code at random, and see if there's something that catches my eye or delights me.

After all, reading code is not just good for you; it is fun.

Footnotes:

I should probably wait until I've read her new book to write this, since I'm sure it has some great insights about this, but I can't wait.

Tangent: perhaps this will never happen in software because nothing runs reliably long enough for anyone to think we can get rid of the programmers.

"If you don't have time to read, you don't have the time (or the tools) to write. Simple as that." — Stephen King.

⁴

I feel this is advice commonly given to young mathematicians but I can't find any source for it, at the moment. I think it must derive from this Niels Abel quote.

⁵

Chuck Moore is ahead of the game on this one.

⁶

I feel this is inherent, because software is made of decisions.

⁷

"The future masters of technology will have to be light-hearted and intelligent. The machine easily masters the grim and the dumb." — Marshall McLuhan

⁸

Ok, a friend convinced me to include a few places to start if you're really at a loss; since I talk about C at the end, how about some C code I've enjoyed reading recently: postgres, anything by cperciva, Illumos, sqlite. Some of these are pretty complicated, but typically stylistically good, and uncommonly well-commented.

⁹

Desperately absent in this article is an acknowledgement of how much free and open source software has changed the world, but I don't know how to write about it. The beginning of David MacIver's Programmer at Large makes me think, though.

Are Jump Tables Always Fastest?

2017-10-03T02:30:00+0000

tl;dr: I make a petty point about premature optimization; don't go out and rewrite your switch statements as binary searches by hand; maybe do rewrite your jump tables as switch statements, though.

A couple of years ago I got into an argument in a job interview. In this case, the question was how I would implement dispatch for a protocol handler efficiently, and my answer was that I would write the most obvious code possible, probably with a switch statement, and see what my compiler produced, before making any tricky implementation choices. (Of course, my answer should have been, "write the obvious thing and then measure it", but I have always had a thing for inspecting compiler output.)

The interviewer indicated that this wasn't the answer they wanted to hear, and kept prodding me until I realized the question they thought they were asking was "how do you implement a jump table in C". At the time, I grumbled a bit that a jump table wasn't necessarily the best approach, especially if the distribution of dispatch cases is uneven, but I didn't fight too much. This stuck with me, though, for a few reasons:

any reasonable compiler will emit a jump table for a dense switch statement if it judges prudent;
given the complexities introduced by the cache and branch prediction, it's not a safe assumption that comparison-based dispatch is slower than a jump table;
any time you feel wronged in an interview you'll never let it go.

The more I thought about it, the more I wondered how big these effects might be. I started to work on a simple experiment to evaluate the performance of table-based dispatch versus comparison-based dispatch, and reviewed the literature. (Spoiler: if you're looking for a rigorous experimental evaluation of these effects, it's not in this article. Read Roger Sayle's A Superoptimizer Analysis of Multiway Branch Code Generation for that.)

I got a push to finish this when I attended ILC2014, where Robert Strandh presented a paper on improving generic dispatch in CLOS which relied on the idea that table-based dispatch methods pay a penalty on modern hardware due to the additional, non-sequential memory accesses.

However, I still didn't do this, and the code sat for a long time.

But it came up again, I had some more literature references, and everyone loves a juicy interview story, so let's uncharitably phrase the interviewer's hypothesis as "jump tables are always faster" and show that this isn't so.

1. a little background

What do I mean by jump tables and so on? Imagine we have code like this:

switch (packet[9]) {
  case PROTO_TCP: return handle_tcp(packet);
  case PROTO_UDP: return handle_udp(packet);
  case PROTO_ICMP: return handle_icmp(packet);
  ...
}

Let's abstract that a little. Our experiment will actually generate C code like this:

void dispatch(unsigned state) {
  switch (state) {
    case 0: fn_0(); return;
    case 1: fn_1(); return;
    case 2: fn_2(); return;
    case 3: fn_3(); return;
    default: abort();
  }
}

The code the interviewer was looking for me to manually transform that into is approximately this:

static void (*vtable[4])(void) = { fn_0, fn_1, fn_2, fn_3 };

void dispatch(unsigned state) {
  if (state >= 4) abort();
  (*vtable[state])();
}

That's a jump table. We'll write it in x86-64 assembly like this:

        .text
dispatch:
        cmp $4, %edi
        jae 1f
        jmp *vtable(, %edi, 8)
1:      call abort
        .data
        .align 16
vtable:
        .quad fn_0, fn_1, fn_2, fn_3

How else could the compiler compile that switch statement? A really simple (but not always ineffective) way is linear search:

dispatch:
        cmp $0, %edi
        je fn_0
        cmp $1, %edi
        je fn_1
        cmp $2, %edi
        je fn_2
        cmp $3, %edi
        je fn_3
        call abort

But if we have a lot of cases, or they're widely spread out, we'd probably want to at least use binary search:

dispatch:
        cmp $2, %edi
        jae .L1
.L0:    cmp $0, %edi
        je fn_0
        jmp fn_1
.L1:    je fn_2
        jmp fn_3
        call abort

Those are our basic options, although when we look at the literature, we'll see there are a range of other choices.

What is this good for in general? We find the pattern of "dispatch to a handler", often implemented with switch or pattern matching, all over the place: in finite state machines (check out this FSA-based packet filter), generic dispatch, protocol handlers, simple interpreters, and virtual machines. (Both Linux and FreeBSD use table-based dispatch instead of switch for handling IP packets, although I'd argue this is more about flexibility than speed.)

For example, the canonical implementation Tcl (generic/tclExecute.c:2417), Lua (lvm.c:793), as well as the most portable forms of Python and Ruby's bytecode interpreters, use switch-based dispatch.

Threaded code is more popular for interpreters these days; see Ertl and Gregg, The Structure and Performance of Efficient Interpreters, as well as this interesting comment about the efficiency of threaded code versus what the compiler generates for switch in CPython's Python/ceval.c:825.

Threaded code is basically where the end of each handler dispatches to the next one; this makes sense in a VM where you have the whole program, but not in a protocol handler where you probably don't have the next packet.

Threaded code should have better branch prediction behavior than a jump table with a single dispatch point (for a nice analysis of this, see Eli Bendersky's Computed goto for efficient dispatch tables), although indirect branch prediction should help even the field. (Update: See Branch Prediction and the Performance of Interpreters - Don’t Trust Folklore; hardware has already caught up.)

But we're getting ahead of ourselves. Can we find any support in the literature for our argument? (Ok, we could look at the GCC source, but I promise, the literature is a better place to start in this case.)

2. the early literature

Looking around, we quickly trace back to Arthur Sale's The Implementation of Case Statements in Pascal (1981), which is interesting just for the details on how simplistic many compilers of the time were, often because they had to emit code immediately.

Sale points out that most Pascal compilers at the time would produce simple jump tables from switch statements, and proposes binary search instead.

Robert Bernstein (Producing good code for the case statement (1985); paywalled, sorry) goes into some gritty details; he points out that binary search usually takes one more comparison than linear search per case, and asserts that binary search is faster than linear search when there are at least 4 case items. We'll revisit that further on.

Bernstein talks about some of the latitude we have in optimizing these search trees; for example, that paths which lead to traps can have the most instructions, since they're not likely to be repeatedly executed, and we can decide whether to put the jump-above / jump-equals first.

3. a dialogue with the compiler

How do we decide some of these things? Bernstein says:

Faster executing code can be produced if the probabilities that the case selector takes on the case item values are known. These may be known as a result of trace information that is automatically supplied to the compiler, or perhaps as an extra-lingual mechanism pertaining to the case statement. In step 5 , the linear search can be arranged in decreasing probabilities, and a Huffman search rather than a binary search can be used.

How can you supply this information to the compiler? I found Dan (djb) Bernstein's the Death of Optimizing Compilers made concrete everything I had been thinking for a while about the need for a dialogue between the programmer and the compiler.

Right now, we can supply a bit of information ahead of time; for example, in GCC, we can use __builtin_expect (which you might be familiar with from the Linux kernel's likely~/~unlikely macros). LLVM has branch weight metadata, although it looks like clang's interface to this is only the same limited __builtin_expect interface as GCC.

We can do much better with profile-guided optimization (PGO) / feedback-driven optimization (FDO); see AutoFDO in GCC, -fprofile-arcs, -fbranch-probabilities in GCC, et cetera, but these (like the benefits of JIT compilers) require us to run the program with representative workloads. This might be better for some cases, but is a bit unsatisfying if we want to communicate in the source code (to both reader and compiler), something we know ahead of time about the distribution of cases.

(Outside the scope of this article, but interesting: branch prediction tables are power hungry; we could want to optimize a specific switch for power instead of performance. This is another application of this kind of dialogue with the compiler.)

4. later literature

Through the 90s, there's a trickle of papers: Kannan and Proebsting issue an important practical correction to Bernstein; H.G. Dietz's Coding Multiway Branches Using Customized Hash functions (1992) proposes a hashing approach to the problem; Erlingsson et al.'s Efficient Multiway Radix Search Trees (1996) presents a better approach for sparse sets.

Already it's becoming clear that switch generation isn't cut and dry, and then through the 2000s Roger Sayle publishes several papers, culminating in A Superoptimizer Analysis of Multiway Branch Code Generation, which almost saves us from having to run any experiment at all. It contains citations to lots of related work (much of it by Sayle himself), but more importantly, a great summary of techniques, and benchmarks that pretty much prove our point for us already.

Sayle demonstrates that GCC can produce a wide range of code for switch, suitable to many different situations, and even suggests that the compiler should detect attempts to manually implement switch (usually for irrelevant performance considerations relevant on legacy systems) and undo them.¹

5. an experiment

However, as I am a big believer in rhetorical benchmarks, we might as well construct a small experiment that definitively proves our (admittedly ungracious and cherry-picked) point.

I wrote some code for this a few years ago, and then my ambitions for what should be tested grew beyond measure, leading to nothing getting done. So, in restoring this, I decided to explicitly allow myself to release some terrible code full of measurement errors: after all, this is a rhetorical benchmark (and what benchmark isn't?) — it doesn't need to be correct, it only needs to prove our point.²

The important thing for making our point is that we can find branch probabilities that support almost any implementation choice. For example, an IP stack protocol handler might encounter TCP packets 70% of the time, UDP packets 25% of the time, and other stuff 5% of the time. (Note: not an actual measurement.) In this case, if we can bias our dispatch code towards TCP and UDP packets, we're likely to get a win.

In this case, we'll choose a distribution like this (but even more unfair), where 80% of the time, we take the first case, and 20% of the time we take a random case. So we run: (on a machine quite unsuitable to benchmarking)

make run-experiment CALL_DISTRIBUTION=pareto N_DISPATCHES=5000000 N_RUNS=5 N_ENTRIES=256

and cherry-pick some spurious but convincing results:

Performance counter stats for './x86_64-binary 5000000' (5 runs):

    6,883,819,114      cycles                    #    2.090 GHz                      ( +-  0.43% )
      232,004,486      instructions              #    0.03  insns per cycle          ( +-  0.06% )
       56,828,213      branches                  #   17.257 M/sec                    ( +-  0.04% )
        1,262,892      branch-misses             #    2.22% of all branches          ( +-  0.05% )

      3.299025345 seconds time elapsed                                          ( +-  0.43% )

Performance counter stats for './x86_64-vtable 5000000' (5 runs):

    7,709,225,443      cycles                    #    2.087 GHz                      ( +-  0.95% )
      217,283,422      instructions              #    0.03  insns per cycle          ( +-  0.03% )
       51,631,368      branches                  #   13.976 M/sec                    ( +-  0.03% )
          957,553      branch-misses             #    1.85% of all branches          ( +-  0.10% )

      3.706410106 seconds time elapsed                                          ( +-  1.04% )

Which allows us to claim what came to us in l'esprit de l'escalier after that ill-fated interview: explicitly-constructed jump tables aren't always faster than what the compiler generates.

(The code is at https://github.com/tokenrove/dispatch-comparison, but I wouldn't recommend using it for anything.)

6. further work

Ideally, we'd want to run a proper experiment on this, that tries all sorts of combinations, with proper cache-flushing³, variable amounts of work in the "handlers", et cetera. Maybe later. I'm explicitly calling this experiment cheating and moving on.

I didn't talk about the effects of these techniques on instruction cache usage, which is actually one of the most interesting factors. Unfortunately there are a lot of different patterns to talk about: when the handlers are large enough to boot the dispatch code out of I-cache, versus tight loops that are just dispatching all the time.

There are also a lot of ways we can tweak search:

Knuth breaks down the math for cost for linear search with carefully ordered data (by probability); see TAOCP volume 3, section 6.1.
It's good to know at what point binary search becomes faster than linear search on your machine, as this experiment demonstrates (and then Paul Khuong argues that binary search is basically always better).

Leveraging some of what we've seen so far, could we better exploit branch prediction by combining optimal decision tree dispatch with threaded code? Suppose the end of each VM instruction were rewritten to perform the first $n$ levels of binary search, so the highest level branches would be better predicted (if instruction dispatch is close to a Markov process, anyway).⁴ See also Baer, On Conditional Branches in Optimal Search Trees.

But I think I'll avoid exploring this further until the next querulous job interview.⁵

Footnotes:

Tangent: if we see switch as just an anemic form of pattern matching, there's a lot more of interest in the literature, but that's another rabbithole.

For example, I'm totally ignoring the advice from Kalibera, Jones: Rigorous Benchmarking in Reasonable Time. If luck prevails, I'll write a bit about common easily-made benchmarking errors and the role of rhetorical benchmarks soon.

See Whaley and Castaldo, Achieving accurate and context-sensitive timing for code optimization.

⁴

NB: a binary search sorted by weights is precisely a Huffman search.

⁵

Many people helped me make this article better. Thanks! I'm not sure how to thank you all. Of course the mistakes remain my own.

When is an Erlang iolist an iovec?

2017-01-15T03:30:00+0000

While trying to improve the performance of JSON encoding in an Erlang application last year, I came to wonder about the different representations one can use when writing data to disk or sending it over a socket, and how they map to the OS's underlying facilities.

In Erlang, there is a convention of deferring the creation of large binaries, whereby functions accept a list called an iolist, composed of binaries, characters (integers), and other, nested, iolists. This means that you don't need to waste time and space copying a bunch of pieces of a message into a single buffer before writing it out to a socket, for example.

If you've done much network programming, that concept probably sounds familiar; it's like a less-restrictive version of the struct iovec scatter-gather structures used in many Unix IO calls (see the glibc documentation for example).

Since there's an obvious connection, I asked myself the titular question for this post: when does an Erlang iolist map most closely to an iovec? Is there a layout that minimizes copying and allocation?

tl;dr: Erlang does a pretty good job of this, so don't worry about it and just use iolists. A list of large refcounted binaries maps closely to an iovec with minimal extra copying. Also, some drivers don't even support iolists!

1. Establishing the relationship

I had previously noticed the SysIOVec structure in the ERTS source, which erts/emulator/sys/unix/driver_int.h defines as:

typedef struct iovec SysIOVec;

(As I was fact-checking this article, I noticed that a chapter of the ERTS reference manual is explicit in making this connection. People often complain about the Erlang documentation, but part of the problem is just that you might never think to read a chapter called "How to implement an alternative carrier for the Erlang distribution".)

ErlIOVec, which contains a SysIOVec, is defined like this:

typedef struct erl_io_vec {
    int vsize;                  /* length of vectors */
    ErlDrvSizeT size;           /* total size in bytes */
    SysIOVec* iov;
    ErlDrvBinary** binv;
} ErlIOVec;

There's an erlang-questions mailing list post by Scott Fritchie that points to the functions io_list_to_vec() and io_list_vec_len(). Let's take a look at them.

2. Reading the source

Here's the beginning of io_list_vec_len: (in erts/emulator/beam/io.c)

/* 
 * Returns 0 if successful and a non-zero value otherwise.
 *
 * Return values through pointers:
 *    *vsize      - SysIOVec size needed for a writev
 *    *csize      - Number of bytes not in binary (in the common binary)
 *    *pvsize     - SysIOVec size needed if packing small binaries
 *    *pcsize     - Number of bytes in the common binary if packing
 *    *total_size - Total size of iolist in bytes
 */

static int 
io_list_vec_len(Eterm obj, int* vsize, Uint* csize,
                Uint* pvsize, Uint* pcsize,
                ErlDrvSizeT* total_size)
{

This is called by erts_port_output, and a common binary gets allocated based on the csize returned. In our thought experiment here, we want to avoid allocating this common binary, and we would like to make sure vsize is minimized and that we can efficiently pack the iovec without allocation and copying.

Note that we figure out both packed and unpacked sizes. Basically, if the unpacked vsize is small enough, we won't pack, since that means less work (as we'll see below).

It's worth noting here that copying things into a common binary, although it will cause an allocation, can be much faster than forcing the OS to work with a long iovec.

io_list_vec_len() continues:

    DECLARE_ESTACK(s);
    Eterm* objp;
    Uint v_size = 0;
    Uint c_size = 0;
    Uint b_size = 0;
    Uint in_clist = 0;
    Uint p_v_size = 0;
    Uint p_c_size = 0;
    Uint p_in_clist = 0;
    Uint total; /* Uint due to halfword emulator */

erts/emulator/beam/global.h says of DECLARE_ESTACK: "Here is an implementation of a lightweiht stack." (sic) It basically lays out a little array of terms, initially on the stack (up to 16 items), which can be migrated to the heap if the stack grows too much.

(Reading the ERTS allocator code is an easy way to spend a lot of time; it's pretty convoluted, and figuring out exactly which allocator ends up doing what is non-obvious on one's first few encounters with that code.)

    goto L_jump_start;  /* avoid a push */

    while (!ESTACK_ISEMPTY(s)) {
        obj = ESTACK_POP(s);
    L_jump_start:

We begin by jumping into the loop. Note throughout this code the careful use of goto to avoid unnecessary churn on the stack.

        if (is_list(obj)) {
        L_iter_list:
            objp = list_val(obj);
            obj = CAR(objp);

If obj is a cons cell, we inspect the head.

            if (is_byte(obj)) {
                c_size++;
                if (c_size == 0) {
                    goto L_overflow_error;
                }
                if (!in_clist) {
                    in_clist = 1;
                    v_size++;
                }
                p_c_size++;
                if (!p_in_clist) {
                    p_in_clist = 1;
                    p_v_size++;
                }
            }

If it's a byte (specifically, a fixnum under 256), we add to the size of the common binary required. If we weren't already in a section that will point into the common binary, we have to add to the size of the underlying iovec. Note the behavior is the same for packed and unpacked.

So our first rule is probably going to be "avoid interspersing binaries and strings". This will keep vsize lower.

            else if (is_binary(obj)) {
                IO_LIST_VEC_COUNT(obj);
            }

Back to our loop, still handling the elements of a list, if we got a binary instead, we invoke IO_LIST_VEC_COUNT.

This is a monster macro in the same file. Let's take a look in pieces:

#define IO_LIST_VEC_COUNT(obj)                                          \
do {                                                                    \
    Uint _size = binary_size(obj);                                      \
    Eterm _real;                                                        \
    ERTS_DECLARE_DUMMY(Uint _offset);                                   \
    int _bitoffs;                                                       \
    int _bitsize;                                                       \
    ERTS_GET_REAL_BIN(obj, _real, _offset, _bitoffs, _bitsize);         \
    if (_bitsize != 0) goto L_type_error;                               \

We start by getting a bunch of properties about the binary, and erroring out if this is a bitstring. In Erlang, a bitstring is a binary with a number of bits which isn't a multiple of 8. Note that we don't error out if this binary has a non-octet bit offset.

    if (thing_subtag(*binary_val(_real)) == REFC_BINARY_SUBTAG &&       \
        _bitoffs == 0) {                                                \
        b_size += _size;                                                \
        if (b_size < _size) goto L_overflow_error;                      \
        in_clist = 0;                                                   \
        v_size++;                                                       \

If this is a byte-aligned refcounted binary, we could put it a reference right in the iovec. This is the main answer to this blog post's question.

        /* If iov_len is smaller then Uint we split the binary into*/   \
        /* multiple smaller (2GB) elements in the iolist.*/             \
        v_size += _size / MAX_SYSIOVEC_IOVLEN;                          \

The commit for this code, added in 2016, notes:

On windows the max size of an iov element is long, i.e. 4GB so in order to write larger binaries to file we split the binary into smaller 2GB chunks so that the write is possible.

I bet that was inspired by a fun bug.

        if (_size >= ERL_SMALL_IO_BIN_LIMIT) {                          \
            p_in_clist = 0;                                             \
            p_v_size++;                                                 \
        } else {                                                        \
            p_c_size += _size;                                          \
            if (!p_in_clist) {                                          \
                p_in_clist = 1;                                         \
                p_v_size++;                                             \
            }                                                           \
        }                                                               \

This code applies only to our packed counts. ERL_SMALL_IO_BIN_LIMIT is 4*ERL_ONHEAP_BIN_LIMIT, which is 4×64 = 256. So if the binary is less than 256 bytes, and we need to pack, we'll copy it into the common binary instead of referencing it directly.

    } else {                                                            \
        c_size += _size;                                                \
        if (c_size < _size) goto L_overflow_error;                      \
        if (!in_clist) {                                                \
            in_clist = 1;                                               \
            v_size++;                                                   \
        }                                                               \
        p_c_size += _size;                                              \
        if (!p_in_clist) {                                              \
            p_in_clist = 1;                                             \
            p_v_size++;                                                 \
        }                                                               \
    }                                                                   \
} while (0)

Otherwise, if this is a heap binary, we always copy it into the common binary.

Back in io_list_vec_len():

            else if (is_list(obj)) {
                ESTACK_PUSH(s, CDR(objp));
                goto L_iter_list;   /* on head */
            }

If we got a list, we push it on the stack to deal with later.

            else if (!is_nil(obj)) {
                goto L_type_error;
            }

Finally, if there's anything else that isn't the empty list, that's an error.

            obj = CDR(objp);
            if (is_list(obj))
                goto L_iter_list;   /* on tail */
            else if (is_binary(obj)) {  /* binary tail is OK */
                IO_LIST_VEC_COUNT(obj);
            }
            else if (!is_nil(obj)) {
                goto L_type_error;
            }

Here we handle the tail of the list. It's interesting that an iolist doesn't need to be a proper list (you can have [List | <<"bin">>]).

So, our conclusions so far would seem to be that an Erlang iolist maps most closely to an iovec when it is an arbitrarily nested list of refcounted binaries – and if there are more than 16 of them, they must be at least 256 bytes long each.

Since the stack for processing these lists changes its allocation strategy around 16 items deep, in practice you probably don't want your iolists to be too deeply nested.

Let's look at io_list_to_vec:

static int
io_list_to_vec(Eterm obj,       /* io-list */
               SysIOVec* iov,   /* io vector */
               ErlDrvBinary** binv, /* binary reference vector */
               ErlDrvBinary* cbin, /* binary to store characters */
               ErlDrvSizeT bin_limit)   /* small binaries limit */

The comments here are pretty clear; obj is the input and iov is the output. Whether to pack or not is expressed by bin_limit. As we look at the rest of the code, we'll see how the other arguments are used. The function begins with these declarations:

{
    DECLARE_ESTACK(s);
    Eterm* objp;
    char *buf  = cbin->orig_bytes;
    Uint len = cbin->orig_size;
    Uint csize  = 0;
    int vlen   = 0;
    char* cptr = buf;

Note that we setup buf, len, and cptr according to cbin. Their use will become clear as we go on.

We continue with a jump into the body of our loop:

    goto L_jump_start;  /* avoid push */

    while (!ESTACK_ISEMPTY(s)) {
        obj = ESTACK_POP(s);
    L_jump_start:

We can see that the structure of this function mirrors that of io_list_vec_len().

        if (is_list(obj)) {
        L_iter_list:
            objp = list_val(obj);
            obj = CAR(objp);

If our term is a cons, we look at the car.

            if (is_byte(obj)) {
                if (len == 0)
                    goto L_overflow;
                *buf++ = unsigned_val(obj);
                csize++;
                len--;

If it's a byte, we store that byte in the buffer passed in as cbin.

            } else if (is_binary(obj)) {
                ESTACK_PUSH(s, CDR(objp));
                goto handle_binary;

If it's a binary, we push the tail of the list we're working on onto the stack so we pick up there later, and jump to handle_binary.

            } else if (is_list(obj)) {
                ESTACK_PUSH(s, CDR(objp));
                goto L_iter_list;    /* on head */
            } else if (!is_nil(obj)) {
                goto L_type_error;
            }

If it's a list, save our current position in the outer list to the stack, and start walking the inner list.

            obj = CDR(objp);
            if (is_list(obj))
                goto L_iter_list; /* on tail */
            else if (is_binary(obj)) {
                goto handle_binary;
            } else if (!is_nil(obj)) {
                goto L_type_error;
            }

How the cdr of the cons is handled should be unsurprising by now.

        } else if (is_binary(obj)) {
            Eterm real_bin;
            Uint offset;
            Eterm* bptr;
            ErlDrvSizeT size;
            int bitoffs;
            int bitsize;

        handle_binary:
            size = binary_size(obj);
            ERTS_GET_REAL_BIN(obj, real_bin, offset, bitoffs, bitsize);
            ASSERT(bitsize == 0);

The binary handling case isn't too different from IO_LIST_VEC_COUNT above. We die immediately if this is a bitstring.

            bptr = binary_val(real_bin);
            if (*bptr == HEADER_PROC_BIN) {
                ProcBin* pb = (ProcBin *) bptr;
                if (bitoffs != 0) {
                    if (len < size) {
                        goto L_overflow;
                    }
                    erts_copy_bits(pb->bytes+offset, bitoffs, 1,
                                   (byte *) buf, 0, 1, size*8);
                    csize += size;
                    buf += size;
                    len -= size;

It's interesting that the bit offset is handled here, even though extra bits aren't permitted. The git history associated with this block is not helpful.

                } else if (bin_limit && size < bin_limit) {
                    if (len < size) {
                        goto L_overflow;
                    }
                    sys_memcpy(buf, pb->bytes+offset, size);
                    csize += size;
                    buf += size;
                    len -= size;

If we're packing small binaries and this one is below the limit, copy it in.

                } else {
                    if (csize != 0) {
                        io_list_to_vec_set_vec(&iov, &binv, cbin,
                                               cptr, csize, &vlen);
                        cptr = buf;
                        csize = 0;
                    }
                    if (pb->flags) {
                        erts_emasculate_writable_binary(pb);
                    }
                    io_list_to_vec_set_vec(
                        &iov, &binv, Binary2ErlDrvBinary(pb->val),
                        pb->bytes+offset, size, &vlen);
                }

Otherwise, we make the direct translation to an iovec entry. If we need to emit a reference to part of the common binary, do that first.

Note the curiously-named erts_emasculate_writable_binary() which seems to shrinkwrap the binary (reallocate it to trim unused space).

            } else {
                ErlHeapBin* hb = (ErlHeapBin *) bptr;
                if (len < size) {
                    goto L_overflow;
                }
                copy_binary_to_buffer(buf, 0,
                                      ((byte *) hb->data)+offset, bitoffs,
                                      8*size);
                csize += size;
                buf += size;
                len -= size;
            }
        } else if (!is_nil(obj)) {
            goto L_type_error;
        }
    }

If this is a heap binary, it's much less interesting. We just append to the common binary.

    if (csize != 0) {
        io_list_to_vec_set_vec(&iov, &binv, cbin, cptr, csize, &vlen);
    }

After all that, we reference the tail of the common binary, if we have anything left in it.

Finally we come to the end of the function:

    DESTROY_ESTACK(s);
    return vlen;

 L_type_error:
    DESTROY_ESTACK(s);
    return -2;

 L_overflow:
    DESTROY_ESTACK(s);
    return -1;
}

Not much to see here, but we might as well include it as we've looked at every other part.

This isn't complete without noting how this pair of functions is called by functions like erts_port_output(). I won't get into the guts, but if you look in the same file, you'll see that an iovec of SMALL_WRITE_VEC (16) is setup on the stack, and if io_list_vec_len() reports a vsize larger than that, it has to allocate the space using the ERTS_ALC_T_TMP allocator.

Even if csize is zero, a binary gets allocated anyway (driver_alloc_binary(0) returns a valid allocation). That seems like a waste but I guess it makes subsequent logic simpler.

3. Confirming what we've found

When I decided to write a test to verify this, I had some problems. First, I thought writing to /dev/null with file:write/2 would be the easiest way to isolate the differences. After some puzzling initial results and investigation with strace, I discovered that file:write/2 always converts the iolist to a binary before sending it to the file driver! (I imagine this is to avoid copying long strings.)

Then I tried sending data to cat >/dev/null opened as a port. At least now strace confirmed that writev was being called, but everything was being packed… a quick trip into gdb revealed drv->outputv wasn't set — this driver doesn't think it supports iovecs!

Ok, so a quick grep in erts/drivers/ reveals a lot of drivers don't define outputv. The only thing I could confirm would definitely use iovecs all the way was the TCP driver. An informal test, with ncat -k -l >/dev/null on the other side, confirmed the differences, and showed the case where the iovec structure is preserved rather than packed as being three times slower than packing cases.

So, although this is interesting, being careful about the iolist structure generated isn't likely to get me any big wins for JSON encoding.

The iolist-iovec correspondance is probably most useful when you need to send really large iolists, ones that might be too large to allocate in one place.

#1GAM February 2015: ZooKicker

2015-03-01T03:30:00+0000

One who makes no mistakes never makes anything.

It's Nuit blanche à Montréal, three in the morning, but I'm not out in the city, surrounded by revellers; I'm at home, hunched over an aging Thinkpad, asking myself, "Is this a game? Can I release this?". I tweak another detail, and blaze through the game's three stolen levels again, prolonging the inevitable.

It's #1GAM time again. How did I end up with another last-minute crunch after supposedly learning my lesson last month?

(I will be updating this post with links to binaries shortly; for now, you can build ZooKicker from source using my modified tsdl and extra libraries.)

ZooKicker Linux x86_64 Debian sid

How to play ZooKicker: move with the cursor keys; press space to kick a square in the direction you are facing. You can kick a square through another square, as long as it's unobstructed. The goal is to kick pairs of squares of the same color together. There are three levels.

1. Demon of the Fall

I spent most of the month working on reviving a game I started writing in 2004, called Demon of the Fall.

Demon of the Fall was a way for me to pay tribute to my favorite game, Solstice (and its sequel, Equinox, and related isometric puzzle-platformers like Head over Heels). It started as a straight-forward clone, but as Retsyn and I worked together on it, we came up with some uniquely appropriate gameplay elements. I won't say too much about that here, though, because I will probably be giving Demon of the Fall another shot later this year.

All I'll say is that music is central to Demon of the Fall. Because February is also the month of the RPM Challenge, I decided I could get rid of two albatrosses at once by recording the soundtrack as my album for RPM¹, and completing Demon of the Fall.

I started with the best of intentions, as we always do, and resolved to do a little work on it every day. However, the code was in an unusable state in the darcs repo I found, a casualty of a "refactoring" gone wrong.

Over the course of the year, we will examine more of these corpses, and in each case, the cause of death will be the same: refactoring without tests.² I'm sure 2004-me could have given you countless justifications for making these changes without unit tests, but they were all wrong, as 2015-me gets to discover, again and again.

I'll talk about this more throughout the year in these #1GAM posts, but let me just relate this to what I got out of February:

Christer Kaitila talks about the wall as a reason games don't get finished. That's the point where it stops being the drug-like rush of implementing interesting stuff and becomes all about patience, discipline, and other dirty words you spend your early adult years trying to avoid.

I think that a lot of my projects had a cycle like this:

inspiration strikes: I hack out something in a frenzied night or weekend;
there's enough kindling that the fire burns while there are interesting problems to solve and clever algorithms to implement;
but the logs don't catch, and what remains to do is boring (which we dismiss as "too simple" to preserve our ego, though the truth is it's actually "hard but not fun");
time passes, and I wonder whatever happened to project X;
I jump in, but I realize the code is a complete hack, or I've learned a much better way to do some major structural thing, or my knowledge of whatever novel programming language I used has completely changed;
instead of proceeding cautiously (or better yet, just doing the hard-but-not-fun bits), I start cutting huge swathes through the code, breaking everything – "We had to destroy the code in order to save it".

Thankfully, at some point I turned around and saw the trail of dead projects stretching back for miles. Awareness was the first step; I also tried to improve not only my testing and refactoring habits, but also my version control habits (an area where DVCSes have helped a lot). Now I recognize when it's happening, and avoid the sunk cost fallacy that can accompany breaking changes ("I can't revert these commits, they were so much work!").

Anyway, it took a long time to not only undo some of that damage, but also to modernize the code and port it to Windows. Although I worked diligently, an hour or two a day was not enough. My kanban board was like a frozen river.

On February 22nd, I realized that, even if I could ignore all my other work (which I couldn't, since money pays for electricity and guitar strings), there was no way to get Demon of the Fall done by the end of the month.

2. Tricky Kick

After January, though, I had come up with multiple backup plans, in case my primary game for any month didn't pan out. These were mostly plans for clones of simple but fun games that I like. After a bit of deliberation, I decided that I would write a clone of Tricky Kick, a PC-Engine game in the fine tradition of puzzle games about kicking or shoving things, such as Kickle Cubicle and Mendel Palace.

Tricky Kick's puzzles have an excellent property of inducing Einstellung through suggestive placement of the pieces.

One of the reasons I had Tricky Kick in mind was that I had worked on a solver for the levels before, as well as Rush Hour and other Sokoban-like games.³ I knew it had simple mechanics I could implement quickly, and minimal art requirements.

Because of the simple, grid-oriented game state, I considered writing the game with a roguelike interface:

****************
****************
*.*****..*****.*
*.1***....***2.*
*...*....@.*...*
*...3..12..3...*
*...3..12..3...*
*...*......*...*
*.1***....***2.*
*.*****..*****.*
****************
****************

Indeed, the tests still use this ASCII representation:

let kicked_beast_with_no_obstruction_wraps_til_player () =
  compare_boards "
.....1
.1@...
" "
.....1
..@1.."
    (fun it -> move Left it; kick it)

It seemed like something that would be fun in a vector-oriented language like J, but I realized that resolving collisions would be tricky to do idiomatically in J, since they are much easier to deal with sequentially than in parallel.

I had been doing a lot of work in OCaml lately, and since I had prototyped some of my shape grammar stuff with OCaml and the Tsdl bindings for SDL2, I figured I could use a decent language and still get things done. I had delivered software under Windows and OS X with OCaml before, so I figured the porting friction wouldn't be too bad.

3. ZooKicker

In the first ten levels of Tricky Kick, you are kicking adorable animals into each other so that they explode. This is a little bizarre. ZooKicker seemed a suitable name to play on that idea, and I had figured I would draw a bunch of hyper-cute animals to kick around in keeping with that surreal theme.

Of course, none of that polish got done, so the game should really be called ~~SquarePusher~~ RectangleSlipper. What happened?

4. What Went Right

4.1. Testing

I forced myself to write a bunch of tests for all the things that could happen on the board, and this was valuable. It would have been cool to do some fancier property-based testing, but I easily get tangled up in making fancy tests where simple, example-based tests would do. I'm glad I avoided that trap.

4.2. Music archives

Late on the final night, I dug through my archives of unfinished recordings, hoping I would have something I could use as a backing loop to then record some guitar and keyboard over to serve as music. Instead I found way more snippets than I actually needed, and I didn't even bother recording extra parts on top.

Although some of the loops I included are short, I'm happy that every level in the game has its own music, and the music is a big step up from the disaster that happened in January. Maybe by the end of the year I won't be desperately scrambling for music to add at the last minute.

5. What Went Wrong

5.1. Not sending builds to friends

Something that went right in January was uploading new ROM images daily and soliciting feedback from friends, even when the game was trivial and bugs made it unplayable.

Meanwhile, I didn't have Demon of the Fall actually running until the 17th, and I never produced standalone builds of it. With ZooKicker, I made a few attempts at producing Windows binaries and statically linked Linux binaries, but it seemed like too much of a hassle at the time.

Being able to get feedback from people early on is a big motivator, and although I am adverse to spending the little time I have each day to work on this on infrastructure tasks, it seems that it would be worth making that one of the first steps in future #1GAM developments.

5.2. Stealing Levels

Maybe this is something that went right, in a sense; it allowed me to release something. But I'm not happy with it.

I knew, when considering ZooKicker as a backup option, that one of the hardest parts would be coming up with good level designs. I persuaded myself to copy levels from Tricky Kick in order to get things working, thinking that I would have time to spend on level design. I even thought I might have enough time to adapt my solver code into some kind of procedural level generator.

No such luck. I'm sorry about that. But it's only my second greatest disappointment with this month's game; the first is the art.

5.3. Underestimating Art

Trying to learn from January's experience and Demon of the Fall, I applied the McFunkyPants method, but perhaps a little too dogmatically; or maybe I just didn't allocate enough time to the project til the end (it was mostly an hour here and there for the last week of February, until the big push on the last day). Either way, I kept my focus on the no-art playable for longer than was healthy, given that I am a slow and inexperienced artist.

Art takes time proportional to your desired quality level divided by your skill level. I had hoped to make some cute vector creatures to populate the game, but nothing reached a consistent quality level I could justify replacing the programmer art with.

Looking back, I think I could have thought further outside the box and gotten something better together: for example, I could have taken photos of small plastic farm animals (which we have around the apartment, somewhere) and used those as the animal sprites.

In the end, I added a facing indicator to the player's rectangle, which was the final admission that art was not happening this time around.

5.4. Never Trust a New Tool

I have used OCaml for many things, on and off, in the last decade, but I haven't done much game development with it.

I am a huge admirer of Daniel Bünzli's OCaml libraries, but it turned out I had made some rash assumptions about the state of Tsdl. There were no bindings for any of the usual SDL helper libraries. These libraries, such as SDL2_ttf, SDL2_image, and SDL2_mixer, are not necessarily the most full-featured or optimized implementations, but they are incredibly handy for quickly throwing together a game, and I had just assumed I would have them on hand.

So, I had to modify tsdl and create bindings for SDL2_image and SDL2_mixer.⁴ Of course, I end up doing that, and learning what the development workflow is for opam packages⁵, and learning ctypes, on the final day when I really just needed to be creating content and doing polish.

The other thing that bit me is that there seems to be no way to declare that foreign objects have dynamic extent, which I guess is a Lispism, that means (in this case) that this object should live on the stack.⁶

Not having dynamic extent is a huge pain in the ass, especially when interfacing with C code which often has an API designed around the idea that small, temporary structures can be cheaply setup without adding any memory pressure.

At first, my trivial no-art playable was GC'ing every few seconds, which is totally unacceptable. It is entirely possible (and desirable), when writing games in garbage-collected languages, to never trigger a GC in the inner game loop. Since I've done this before in other GC'd languages, I had assumed (given OCaml's pragmatism, in general) that this would be no problem.

Memory pressure isn't a problem for ZooKicker right now, but it did give me a scare. Anyway, efficiency isn't something I should be talking about with a game thrown together quickly like this, where List.find accounts for around 7% of the total execution time.

6. A digression: code dumps and maintained software

So, as a result of this project, I have now added two more code dumps to the FOSS landscape, despite resolving to avoid this. I'm going to write more about this soon, but I've realized that I have been a poor free software citizen over the past two decades: I would just leave a tarball somewhere to gather dust (or open a public repo, in the github era), rather than tending my open source code like a garden, and I am changing that.

I still believe code dumps are better than not releasing code at all. Sometimes it's much better to be able to build on someone's unmaintained implementation of something than to start from scratch. That said, there's something unconscientious about it.

There is, of course, an irony about pointing this out in a post about a game like this, which is often the purest form of code dump, since games are rarely maintained.

7. Segue to March

What have we learned?

learning lessons is hard;
content has to come into the picture early;
never do more than one new thing at once.

Is this a game? Yes. Can I release this? Yes. Am I disappointed? Sure, but March is another month, and all I can do is try again.

Footnotes:

I have participated in RPM every year of its existence, and I have never finished what I intended to finish (although I did finish something one year, but the less said about that, the better.)

Beyond that, any time you're tempted to call something a "big bang" refactoring, it's not refactoring.

I beat all 60 levels of Tricky Kick without using the solver, though. Click here to see the password to unlock all the levels.

⁴

Only enough to get ZooKicker running, for now.

⁵

You probably want path pins, not git pins, no matter what the opam tool tells you.

⁶

Add dynamic extent to the long list of features in CL that many modern languages omit, only to be rediscovered as if novel in the next wave of "system" programming languages.

#1GAM January 2015: Balloon Spite

2015-02-01T03:30:00+0000

Want the game? Here's the ROM. Press B to flap. Hit L or R in the select screen to choose an alternate palette. START skips most screens.

In December, I found out about One Game a Month (abbreviated #1GAM), which is a kind of personal challenge to finish and release one game every month for a year. (I am no stranger to ridiculous personal challenges.)

I read Christer Kaitila's blog post about #1GAM; it deeply resonated with me when I read, "I’ve started so many more games than I’ve finished in the last 20 years."

Even before I knew I wanted to be a programmer, I wanted to make games. I'm sure this is a stated goal as common and clichéd as "astronaut" for the children of the Atari age, but I did pursue it right through my childhood. After typing games in from magazines, I started writing my own, and from somewhere around age six or so onwards until my early twenties, I wrote games. A lot of games.

So where are they? Even the most promising projects, products of fertile collaborations with talented friends, were never finished.

I had thought about doing a personal "games retrospective" before, but I wasn't sure how to do it properly. #1GAM presented an opportunity to dredge up some past games in a time boxed manner, and to learn more about finishing things.

1. January

January got off to a rough start, though. I had a huge list of games I wanted to work on, and I became paralyzed by choice. After a week of regret and despair, I came back to it more systematically. I couldn't work on a game full-time, so it had to be something I could conceivably release on about a commit or two a day.

I dug through my archives. Unfortunately, the backups I had with me only had a smattering of working copies of CVS checkouts of some old projects, rather than the repositories themselves. One project looked like it met the criteria: Hyper Ballon Struggle.

In 2002, I ordered a Gameboy Advance flash linker from Lik Sang, and RhombusSoft started working on GBA games. (I'll explain RhombusSoft in later #1GAM blog posts.) Hyper Ballon Struggle was a project to see how much GBA game fit in a weekend's worth of development. (We would often have weekend game making parties, which were basically game jams now that I think about it, although the concept was not known to us at the time.)

It was intended as a Balloon Fight / Joust clone featuring a roster of characters from other games we had developed, in the spirit of crossover/all-star games like Wai Wai World, Saturn Bomberman, or Super Smash Bros.

Once I figured out how to build the copy I had, however, I realized it was incomplete and out-of-date. I knew that the last version we built that weekend had a playfield (but no playfield collision), music, and a "great" rotscale checkerboard effect in the select screen. This version only had the core Joust-style mechanic and little else.

I made some calls to people still living in Newfoundland (where all this part of history happened), and a backup tape was discovered, dated right after we turned out the lights on our little game studio. I was thrilled, until I discovered there was no way to read the tape; a machine with an internal DDS-2 drive was dug out of storage, but the drive no longer functioned. Not wanting to ship the tape around, I decided to work with what I had.

2. What Balloon Spite is

I renamed Hyper Ballon Struggle to Balloon Spite, because I never pass up a cheap pun (Balloonacy was already taken).

2.1. As a game

Because of the possibility that I wouldn't have time to draw new sprites, I decided to turn it into a kind of Street Fighter II-style one-on-one game, but with a balloon popping mechanic.

I added an exertion mechanic (indicated by the sweat drop) hoping it would provide some depth to the play, but the stamina values need some tweaking for it to be really useful. You can sometimes tire out the computer opponent if they try camping at the top of the screen, then get in a quick attack.

There are eight levels. The original intent had been to have a level for each character, plus two or more levels with special boss characters, and of course, different music for each level, with nods in the themes to the music from that character's respective game. Alas, as it is, there are only two short, irritating in-game songs, because the music was done in an afternoon.

The characters and what (unreleased) game they're from:

Character		Game
Harvey		AnimoCity
Rudolph		Quest of Zo
Ralph		Buckler Strife
Lopez		Greed'n'Magic
Pierce		Maelstrom
Greedy		Greed'n'Magic
Sam		Convergence
Iceclown		Fobwart

The iceclown isn't a playable character, but was intended to be a boss. (Lopez was supposed to be a boss, too, but for lack of KidThulhu (from the eponymous game) and Peter (from Demon of the Fall), he remained selectable.)

Myr was one of the artists on the team. There also was a sprite that I thought was supposed to be Retsyn, but he disputes this and removed it from the game. (It should be noted that half these games were written by Retsyn; Buckler Strife is the only one I had no involvement in, though.)

Melville is a reference to the Taito game Cameltry. ("Moby died in the Spinning Room.")

2.2. Technically

According to sloccount, it's about 2620 lines of ARM assembly code (ignoring another 1kLOC of tables and such). It's pretty buggy, but it does have a unique character to it. There were some bugs I consciously decided not to expend time on, like interpenetration resolution having the possibility of pushing a character into the playfield. In the final hour or two, copy-and-paste became the dominant coding paradigm. The code is on github.

GrafX2 was used for all pixels, both in 2002 and in 2015. Levels were created as PCXes (yes, PCX, even in 2015) and converted to tiles with a tool borrowed from Convergence. Music, such as it is, was created in emacs and compiled with mumble.

The original game did run on the real hardware, since we didn't have a usable emulator at the time! All my GBA development hardware is in storage, so no new builds have been tested on the real thing. That's okay; I will be testing it in the future, and maybe I'll patch it later. Most people will be playing on emulators, anyway.

3. What Went Right

3.1. Targeting the GBA

By producing a ROM image that could be run in any emulator, I made it much easier to send builds to my friends and get quick feedback. This was an unexpected benefit. Getting feedback, even on early, broken builds, was helpful in maintaining motivation.

I remember that iterating with the flash linker was a really slow process, so I have no shame in targeting an emulator where I can get nearly immediate feedback from a build.

3.2. What went right: Retsyn's great pixels

I was already blessed with some pretty great sprites:

Or so I thought, until Retsyn came to the rescue at the last minute and revamped the sprites and palettes:

He also did the background for Myr's moon stage. (And, of course, the vast majority of the old Rhombus content was done by him, back in 2002 and before.)

It was inspirational, and kept me going in the final moments of the project. Of course, it also made me regret the time wasted that could have gone into polishing other aspects, like the music and sound.

3.3. What went right: Resolving to Ship

The Perfectionist's Handbook is one of the books that has been a critical part of my journey towards becoming a finisher. I realized I had to expose myself to criticism if I was going to get anything done. I had to stop seeing every piece of code as a reflection of my self-worth. And I had to stop trying to optimize (or elegantize) the hell out of everything.

A number of books draw the distinction between healthy and unhealthy perfectionism, but Szymanski's Handbook was the most useful in convincing me that I could let go of some things without losing the good things about perfectionism. Most importantly, it helped me understand that perfectionism, instead of causing me to release only flawless code, had caused me to withhold tons of good code from release, releasing only mediocre code under duress.

I stole a few tools from my other projects, most importantly tools for converting PCX files to GBA tiles. The code was often terrible – several of the tools were among the first programs I had ever written in OCaml, and it certainly shows. My instinct was to dramatically refactor them immediately, and I am a little proud that I resisted. Maybe I will go back and clean them up, but I recognized that it wouldn't further my goal. The Julian of 2002 would not have been able to bear that.

.section .ewram
.align 2
@@ XXX should use an overlay for this
.lcomm balloons, BALLOON_LEN*MAX_BALLOONS

There were so many opportunities for optimization that I avoided making. For example, separate screens ("activities" in Android lingo) could share the same region of memory for their local variables, using overlays in the linker script. Didn't do it. All the tile data, map data, music data, and even PCM samples (!) are totally uncompressed. Scandalous! Various structures in memory wasted bits or even bytes out of convenience. In 2002, I could never have endured that.

One thing I didn't know at the time was that the code in ROM would have been faster, in general, if it had been written in Thumb mode rather than ARM mode. I strongly considered rewriting my code to use mixed modes, but I reminded myself: I need to ship this in a matter of days. It's just a silly little game. YAGNI.

When it came time to implement normalization of vectors when computing contact normals for collision resolution, I did waste a bit of time thinking about the efficient implementation of reciprocal square root versus atan2. I had also been thinking about implementing the Sunderland algorithm for improving my trigonometric function lookup tables. YAGNI. Maybe later.

compute_contact_normal:
        stmfd sp!, {lr}
        @@ XXX Ideally, we'd use the reciprocal square root here.
        @@ There are great, simple algorithms for it.  But let's get the
        @@ slow way working first.
        mov r0, r7, lsl #PHYS_FIXED_POINT*2
        swi #8<<16              @ sqrt
        mov r7, r0
        mov r0, r8, lsl #PHYS_FIXED_POINT*2
        swi #8<<16              @ sqrt
        mov r8, r0
        @@ r7 = distance (12.4), r8 = penetration (12.4)

I used the slow BIOS division and square-root routines instead; the first time, my hands trembled. I think it's the first time I've written a division on a system like the GBA that wasn't a shift or a reciprocal table lookup. The second time, I stopped and wondered how many divisions I could survive in one frame. By the eighth time, I didn't even think about it. Ship it. Optimize only if there's a reason to do so.

3.4. What went right: I started to enjoy working on it

The most important thing that went right is that, to my surprise, I started to enjoy working on the game; I even started to enjoy playing games again.

It's been years since I've enjoyed playing videogames at all. I think the peak for me was when I was in my early 20s. (If I don't enjoy games, why am I writing them? There's a question to answer later this year. I'm still thinking about it.)

The possibilities of the idea, which seemed stunted as I began, opened up as I spent time with it. I didn't get to exploit any of those possibilities, really, but it was a good lesson.

4. What Went Wrong

Getting back up to speed with the GBA took a little while, and given the late start, it ate up days that would have really counted in the end.

I have to admit that when I started, I was so rusty that I wrote mvn r0, r0 rather than rsb r0, r0, #0 trying to negate an integer and similar kinds of mistakes, but it came back to me eventually.

I didn't have the same cross-compiler the original project was built with, so I dropped in the linkscript I wrote for Convergence and hoped it would work, but this actually resulted in a couple of days of debugging until I realized that the .data and .bss sections were silently being put in ROM. It turned out that, in Convergence, I had always indicated whether space was to be reserved in EWRAM or IWRAM, so I never added a generic BSS or data segment to the linker script. Once I figured it out, I was pretty surprised that ld hadn't complained, but these are the perils of reuse from other projects.

4.1. What went wrong: Making assumptions about tools available

I left the music til pretty late, because I can compose pretty quickly (as evidenced by Chip Weekend); the original music routine written for Hyper Ballon Struggle was missing, but I assumed that the music playroutine in Convergence would be fine, and that my archive of that would include tools to convert some format (probably XM or MIDI) to its internal format.

Imagine my horror when, on January 29th, I attempted to replace the awful test song only to discover that the music conversion tools for Convergence, if they were ever finished, weren't in the copy I had.

I decided I would add support for PH-1 (the Convergence playroutine) to mumble, a compiler for an MML-like textual music description to various playroutines that I wrote when I was working on Atari ST demos and then totally abandoned, circa 2004. It was pretty naïvely implemented, and very incomplete. There was another temptation to rewrite it, especially given how much my understanding of Common Lisp has improved since I wrote it, but at this point I knew there was no time.

;; Victory fanfare
A o3     a12aaa4     >c12ccc4 | c+12c+c+c+4 e12eee4  | e1
B o2 %i2 c+12c+c+c+4 e12eee4  | e12eee4     g12ggg4  | b1
C o2     e12eee4     g12ggg4  | a12aaa4     >c12ccc4 | g+1

I was planning to present Balloon Spite to some friends that evening, so I tried to focus on the straightest path to the goal, even if it meant a lot of compromise. I hacked in crude PH-1 support, although mumble's lack of support for drum kit-style bindings for the noise and PCM channels meant I couldn't quickly write percussion parts. This is one of the worst deficits of the music that ended up in the game, and it's one of the reasons it all feels terribly primitive.

Added to that, the playroutine design for Convergence sucks; there's no getting away from that. Music was left til late in Convergence, too, so the playroutine design was never battle tested. It was my second (or maybe third) chip playroutine design and suffered from me trying to do things differently, thinking that a more musical representation would be compact and efficient. (These days, I think the approach KB used for fr-08 is way better than something like this.) Also, many features were simply unimplemented: no arpeggios, no vibrato, no volume envelopes – those are the essential tools for making chip music that doesn't sound dead, that doesn't sound like the output of BASIC's PLAY statement.

4.2. What went wrong: Not following the McFunkypants Method

I was aware of the McFunkypants Method for finishing a game, but I didn't follow it as closely as I could have.

I think the biggest mistake was not focusing relentlessly on the core gameplay until it was done. I was seduced by my overall vision, especially aspects that were more "tech demo" than game. The "versus" screen was supposed to have a rotscale spinning-out effect on the "VERSUS" text, for example. Levels were supposed to have different combinations of parallax scrolling foregrounds and backgrounds, et cetera. The select screen has its nauseating rotscale checkerboard effect (a recreation of an effect I remember from the original). Players can have alternate palettes (and palettes are pretty customizable in general). Various time was wasting doing these things or experimenting with them.

That time would have been better focused on the core gameplay. As late as last night, my final night working on the project, I was making major gameplay changes, and the build fluctuated between impossibly hard and incredibly easy. The final build is not as much fun as one of the earlier builds, but I accept that as a lesson about priorities and time management.

For February, I'm planning to follow the aforementioned method much more closely. Today I'll be making my storyboard since the game idea is already established and planning what is required to completely implement the core gameplay in the first week.

5. Lessons Learned

I think the summary of all that is:

verify your tools before you start
make choices early and stick with them
the Muse waits for you at your desk

Cumulatively, this was a full weekend of effort in 2002, plus about thirteen days of spare-time hacking and maybe two days of full-on work. Given that timeline, I am pretty happy with the result. Maybe next year I will revisit it and put out a "remix" version with the music and physics it deserves, but even if I don't, I'm content that I gave it a chance to finally see the light of day.

Onward, to February!

Papers We Love Montreal Followup: Procedural Modeling of Buildings

2014-11-29T03:30:00+0000

(I organize a meetup group which is the Montreal chapter of Papers We Love; we meet monthly to discuss a paper or papers someone loves, to help bridge the gap between industry and academia for working programmers. This is a followup to the last meetup, where I talked about the paper Procedural Modeling of Buildings.)

1. On questions

There were some very good questions during my talk, but I think it could be made more clear: at a PWL meetup, there are no dumb questions, and "What's a Spline?" questions can often benefit everyone. For example, I was a little surprised no one asked about Euler operators, since I'd never heard of them until I started implementing the boundary representation code for this talk. There were a few aspects like that I realized I should have explained in more detail. Next time, you can help, by speaking up and asking a question.

Also, I'll add this to the meetup page, but it should be noted that papers, presentations, and questions are welcome in French. If the speaker doesn't understand French, someone will be happy to translate.

2. Grammar-driven test case generation

One of the big things we talked about was how to choose productions when generating from a grammar, and I mentioned Monte Carlo tree search in the context of a game where your generator is "playing against" the system under test, with crashes, resource leaks, or code coverage as the "score". Lots of learning algorithms could be used in this case, but MCTS is worth checking out because it doesn't require coming up with a good admissible heuristic yourself. Also, it's a very cool, very general algorithm that has had tremendous success playing Go against humans.

The ABNF-driven generator I mentioned will be released soon. Until then, check out this one.

The paper I mentioned regarding this approach for protocol fuzzing was Extraction of ABNF Rules from RFCs to Enable Automated Test Data Generation.

3. L-systems

The Algorithmic Botany page is full of great stuff, often L-system-related.

I mentioned Sean Barrett's criticism of L-systems, and I think it's worth a read. I don't agree with his conclusion, but his reduction of a specific L-system down to simple algorithm is a great demonstration of the need to think about generality and constraint in notation.

4. Procedural Modeling of Buildings

The most important link is Peter Wonka's publications page; most of the papers mentioned are there, or are directly cited by one of those papers.

In particular, in addition to the central papers discussed (Instant Architecture, Procedural Modeling of Buildings), check out the papers on Inverse Procedural Modeling, and parallel derivation/evaluation of shape grammars.

Also, I mentioned CityEngine, but be sure to check out their impressive demo videos.

Two things I would have liked to discuss more were the parallels between a grammar-driven generator and a programming language interpreter, and graph rewriting / graph grammars. Maybe those are topics that can come up again at a future PWL meetup.

4.1. Bag of links

Matrix grammar — a potentially simpler way to formalize the LOD idea in the paper;
Shape Grammars — in which you can find reference to many uses of shape grammars in architecture and design, including the coffee maker and Buick applications I mentioned;
Virtual Terrain Project;
Generative Modeling Language;
GRAPE — a parametric shape grammar interpreter;
From topologies to shapes: parametric shape grammars implemented by graphs;
The aesthetics of science fiction spaceship design — be sure to check out the Death Star trench modeling section;
Andrew Li's publications — including some interesting analysis of historical Chinese building standards with shape grammars;
Garment-modeling papers that look interesting but which I haven't read:

5. Demos

The 64k demo we watched at the beginning was The Timeless (1st place, pc 64k, Revision 2014), by Mercury. As I mentioned, given what looks like shape grammar techniques, I suspect the title is a reference to The Timeless Way of Building by Christopher Alexander.

I think one of the big inspirations for procedural content generation at this scale was .the .product (1st place, pc 64k, The Party 2000), another 64k demo by one of my favorite demo groups, farbrausch. The architecture generated is less sophisticated than in the Mercury demo, but given that it was released fourteen years ago, I think it could be considered more impressive.

I meant to talk more about the tools they released, but you can find the source code in their github repo. Among other things, this includes their 96 kilobyte FPS game. There are a couple of talks on YouTube by members of farbrausch talking about some of their techniques that are worth checking out, especially if you're interested in texture generation.

6. The next paper

We decided at the meeting to skip December; I'll be out of town, and late December is a busy time for everyone. This gives us a bit of time to plan the January meetup, so I urge anyone in the group with an interest in presenting a paper to contact me about presenting in January.

Alternately, if there's a topic you'd like to see discussed but don't feel comfortable presenting, why not ask?

Here are some papers and topics I'd love to see presented:

Differential Privacy
Monte Carlo Tree Search
linear types and all kinds of resource-bound guarantees (e.g. Robson's proof that SQLite uses)
Flipping Bits in Memory Without Accessing Them (chilling!)
Coverage is not strongly correlated with test suite effectiveness (provocative!)
A Language-based Approach to Unifying Events and Threads

There's also a poll, here on the meetup website, but I can appreciate that no one has voted on it yet, since meetup's website forces me to make it illegibly dense. Also, if you need help tracking down a copy of a paper, just email me.

ILC2014 summary

2014-08-19T02:30:00+0000

When I heard that the International Lisp Conference (ILC) was happening in Montreal this year, I got very excited. Actually, I started making ambitious plans for a paper to submit and a talk to give, against which the rest of my life conspired, but I certainly registered as soon as registration opened. I've always wanted to go to ILC (and ECLM and so on), but having it happen in the city in which I live meant I had to go.

I've noticed some people on #lisp asking about the talks, so I'll try to give a brief summary of each talk, to the extent that I remember. If you have any questions, drop me a line via email, or on #lisp on Freenode IRC (my nick is tokenrove; attending ILC encouraged me to idle more).

Generally, the talks were much less technical than I had anticipated, and aimed at a very broad audience. Almost every talk seemed to run late, so I refrained from asking questions, interrogating the speakers in person during the breaks instead.

There were about 70 people, overwhelmingly male and white. There were perhaps two or three women. This was a bit of a shock for me, since other technical events I've been around lately have had much less homogeneous demographics.

In spite of my grumblings throughout this post, I have to say that the conference was overall very well organized, and my thanks go out to all the organizers.

1. Friday
2. Saturday
3. Sunday
4. Summary: Rekindling the flame

1. Friday

Luckily I was already familiar with the Université de Montréal campus, otherwise I surely would not have found the venue in time for the first talk. Later, there were posters up in a few places, which helped a bit, but initially there was no indication where this was happening. I heard from at least one other person who did miss the first talk on account of this.

Oh, and of course UdeM is built on quite a steep hill, so I imagine most people arrived a bit sweaty. It was fortunate we weren't having the kind of hot, sticky weather that often happens here this time of year.

1.1. Tutorial 1: Multiplatform and Mobile App Development in Scheme with Gambit/SchemeSpheres

Speaker: Álvaro Castro-Castilla.

The first talk was about an application framework called SchemeSpheres, which provides, among other things, modules, conditional compilation, and a unified build system.

It was nice seeing a focus on application delivery (an area where Scheme traditionally trounces CL). The demonstration was not terribly convincing, but the software itself seemed promising.

The speaker spent considerable time discussing the provision of features which CL already provides. This was the first of a recurring theme at the conference, and to me could have been discussed in the final panel: how do we increase awareness of the powerful facilities provided by CL and its ecosystem?

1.2. Tutorial 2: A Gentle Introduction to Gendl, a Common Lisp-based Knowledge Based Engineering Environment

Speaker: Dave Cooper.

This was basically a walkthrough of some examples and exercises with http://gendl.com/.

It was during this tutorial that the audience was polled for CL users, and the vast majority of the attendees raised their hands.

I wanted to ask some questions about units of measure (a pet bugaboo of mine at work lately) but I think we were already running late.

1.3. Cocktail

The social aspects of the conference were actually the most valuable for me. Hearing about what people were working on, getting to put faces to names, and meeting new people all made the conference worth it.

Oh, and the catering was great. That always helps.

2. Saturday

2.1. What a SOOC!

Speaker: Christian Quiennec

The legendary Christian Quiennec discussed a non-massive MOOC that he ran (https://programmation-recursive-1.appspot.com/course), including copious statistics about participation, and the mechanics of running the course.

For me, the most interesting aspect was his discussion of automated grading. All students were required to test their code, and submit their test suite along with their implementation. The grader then checked not only whether the student's implementation satisfied its test suite, but also other combinations, as follows: (v() being the tests, and f being the implementation, s and t being student and teacher respectively)

coherence	check v_s(f_s) and v_s(f_t)
correctness	check v_t(f_s)
coverage	compare v_s(f_s) and v_t(f_s)

Another interesting concept he mentioned was "epsilon-peeping"; students with incorrect solutions could be shown slightly less incorrect solutions from other students, to guide them towards correctness, and give them experience reading code.

I'm not sure if the talk is available in English anywhere, but the slides for a version in French appear to be available here.

2.2. Kilns: A Lisp Without Lambda

Speaker: Greg Pfeil.

The first talk where I had to think. Greg Pfeil presented a language he's been working on called Kilns, based on the kell calculus. There were two interesting aspects of this talk; the first was the idea of modelling locality in the language, and where that could go. I wondered whether locality could be extended "inward" instead of out – like a memory-hierarchy conscious language like Sequoia.

The second aspect was a bigger goal he described, involving a combination of educational reform and layers of languages to empower everyday programming. This reminded me a bit of some of the ideas coming out of VPRI.

2.3. Using Common Lisp as a Scripting Language

Speaker: François-René Rideau.

Fare talked about the very practical aspects he has driven forward with his work on cl-launch and newer versions of asdf. The situation for CL scripting sounds much nicer than it was even a few years ago.

There was a question about startup time from Marc Feeley, and Fare indicated that scanning for ASDF systems was the bulk of the startup time, and that it was easily eliminated by compiling the script, although he felt it was mostly negligible anyway.

You can find more of the details here: https://github.com/fare/asdf3-2013

2.4. Common Lisp's Predilection for Mathematical Programming

Speaker: Robert Smith.

An interesting talk espousing the virtues of CL as a language for numerical computation and experimental mathematics. Robert Smith showed off some code, which I understand can be obtained from his bitbucket. He demonstrated how CL provided very effective tools for succinctly expressing the constructs he needed to perform this work. It was nice to see someone fire up Emacs and show off a bunch of macros.

I asked him where he felt CL implementations should go in the future to better serve mathematicians, and he pointed me to his blog entry, Things I Want in Common Lisp. It brings up something which was another theme: everyone seems to want CL's standard library to be more generic.

2.5. Typed Clojure

Speaker: Ambrose Bonnaire-Sergeant.

I understand that the organizers put a lot of effort into trying to attract members of the Clojure community to ILC this year, but they don't seem to have been successful. Bonnaire-Sergeant was effectively the only (vocal) Clojure user I met at the conference.

He gave a talk on his work on Typed Clojure, which is directly inspired by the work of Typed Racket. It was a nice overview, but again I found myself saying to myself, "don't people know that SBCL already does a pretty good job at this? Don't people know that CL has had optional typing forever, and that declarations are great?" (Not to mention that the potential utility of declarations extends beyond just type annotations.)

It was originally supposed to be a discussion specifically of the interlanguage interoperability aspects of Typed Clojure, but Bonnaire-Sergeant decided to just describe the Typed Clojure system, given the number of attendees unfamiliar with the work.

2.6. Hygienic Macro System for JavaScript and Its Light-weight Implementation Framework

Speaker: Ken Wakita.

I wasn't too interested in this talk from the abstract, but it ended up being a great presentation. Ken Wakita presented ExJS, which I gathered was a simple, elegant implementation of macros for Javascript that actually had a very palatable syntax. Wakita was very clear about the existing work, how their work differed, and the kinds of problems they solved. One interesting thing about it is that they convert the original Javascript to s-expressions and use an existing Scheme implementation for macro-expansion.

I didn't get a URL for this, and a few seconds of trivial googling didn't help much, but https://github.com/homizu/js-macro might be the github project for ExJS.

2.7. An Array and List Processing System

Speaker: Dave Penkler.

This was the talk I most anticipated, because it's very closely related to work I've been doing that I had hoped to present.

Penkler presented ALPS, a fascinating rapid prototyping environment derived from Lisp and APL, including support for interactive graphics, audio, task scheduler, and probably countless other things.

This seemed to be the epitome of the personal programming environment, as unsuited to the development of stifling "enterprise" software as it is suited to maximally amplifying the output of a single programmer who knows the system intimately, like a well-worn favorite musical instrument.

He went into detail as to why he felt LISP 1.5 and APL/360 were excellent models for this kind of system, as opposed to their more modern descendents.

He demoed the obligatory Conway's Game of Life. One curiosity that was revealed out of this is that ALPS supports both APL's concept of booleans (1 and 0) as well as Lisp's (t and nil), which seems a little confusing.

ALPS does not have a type system, per se, and the set of types was intentionally kept quite simple. The sole numeric datatype is IEEE 754 double floats, not unlike Javascript. The read syntax for arrays simply uses square brackets, to indicate a vector, which can then be shaped multidimensionally with p (an ersatz ρ) if required.

GC is a simple mark-and-sweep approach, but with separate spaces for conses and vectors. I asked if he was doing any reference counting tricks (since you can often overwrite arrays in-place if you know there are no other references) but he indicated that always copying was fast enough.

ALPS has been ported across many machines and architectures; he even has it running on his phone! Penkler indicated that, at this point, having been so widely ported, ALPS' own support library is so all-encompassing that it could be run on the bare metal without any real OS support. It was originally implemented in Pascal, and later ported to C.

He indicated that he wasn't familiar with APROL, which is the main attempt of which I'm aware to blend Lisp and APL. Of course, there are other shades of that: SERIES is inspired by APL, and K is heavily inspired by Scheme (but stays, culturally, on the APL side of the fence). There are also a number of more modern attempts, particularly inspired by J and K, such as Redick, but that's a topic for a later blog post.

As far as I know, there is no public release of this system, but I exchanged contact information with Dave and I will update this page if there are any more details.

2.8. Reaching Python from Racket

Speaker: Pedro Ramos.

A technical discussion of the difficulties of reaching Python libraries which use C modules, and an approach the authors used to solving the problem for Racket. It seemed that the talk covered the same ground as the paper, although it was nice seeing the demo step-by-step.

2.9. Lightning talks 1

My memory of the lightning talks is very fuzzy, so I apologize if I have omitted your talk.

There were a couple of interesting implementation lightning talks. There was a talk on reducing the overhead of structures in Gambit scheme, and one on lazy compilation and code versioning.

There were recruitment spiels from ESS Technology and RavenPack, which I'm sure will also be posted on Lispjobs if they haven't already been.

Paul Tarvydas presented some interesting work on writing reader macros that use PEG parsers to support fancier syntax, and some experiments in flow-based programming. This work is available at https://github.com/guitarvydas.

Finally, Didier Verna gave a very entertaining introduction to his :o( Smilisp :-) dialect to finish off the day.

3. Sunday

3.1. Emacs Lisp on the Move

Speaker: Stefan Monnier.

This was the only truly universal talk at the conference. When asked who didn't use Emacs in the crowd, there was only one hand raised (and they were promptly lynched).

Monnier talked a bit about where elisp came from, some rationale for design decisions which seem painful now, and went into some depth on the efforts to improve the language. Particular advances he described in detail included lexical binding and the new advice system. He indicated that there were many things which constrained the rate of change, although that rate has been increasing in the last few years.

There was some especially interesting discussion of language features that had traditionally been slow, and how that affected the idioms and usage. Finally, there was speculation on the future, including the current progress of running elisp in Guile's VM.

3.2. A Scheme-based Closed-Loop Anaesthesia System

Speaker: Christian Petersen.

Here was a real Lisp-in-the-trenches success story. Christian Petersen described a sophisticated medical application built with Scheme, including application delivery across embedded, desktop and mobile platforms.

It was especially interesting to hear about their approach to safety certification, and his emphasis that formal verification could never entirely replace testing, since the application has to be delivered on top of millions of lines of unverified code in the OS, anyway.

People don't typically reach for a language like Scheme when building safety-critical software like this, so this is a story it would be nice to see spread to a wider audience.

The framework or development system they built in the process of building this and other systems is called LambdaNative, and is open source.

3.3. Leadership Trait Analysis and Threat Assessment with Profiler Plus

Speakers: Michael Young and Nick Levine.

The original application mentioned in the title here, and in the proceedings, was not presented, as the speakers felt a related application, thoughthelper.com, would demo better.

About half the talk was actually an introduction to the concepts of cognitive behavioral therapy, along with a demo of some of those concepts in action in the aforementioned web application. Then we got to see a few of the details of the CLOS-based NLP framework underlying it. I always hate Lisp workflows that are so "clicky" as so much LispWorks usage seems to be, but it presented nicely.

3.4. Efficient Finite Permutation Groups and Homomesy Computation in Common Lisp

Speaker: Robert Smith.

Here was another nice, REPL-driven demo of some interesting code, attached to an interesting mathematical result. I was afraid this was going to turn into an hour-long attempt to explain basic group theory, or alternately an impenetrable presentation of hyperspecialized results, but it treaded a nice middle-ground, and aside from some arguments about the phrase "bit inversions", the audience seemed appeased.

The permutation group code appears to be available at https://bitbucket.org/tarballs_are_good/cl-permutation.

3.5. CL-FFF: A Common Lisp Full Stack Framework for Web Apps

Speaker: Marc Battyani.

There were a lot of framework presentations at this ILC. I've used a lot of Marc Battyani's software in the past so I was eager to see this presentation. Unfortunately, I didn't find it very engaging (though I'm sure cl-fff is a fine framework for web applications); most interesting was Battyani's description of some of the commercial applications he's built in CL.

Here's a link to CL-FFF: https://github.com/mbattyani/cl-fff

3.6. SICL spinoffs: Generic Dispatch, Garbage Collection, and CLOS Bootstrapping

Speaker: Robert Strandh.

This was, hands-down, my favorite talk at ILC. I guess I was expecting more talks of this nature, but this is what I came for. Strandh presented all three of the papers from the proceedings in a condensed form. (Two of those papers seem to be here: http://metamodular.com/generic-dispatch.pdf, http://metamodular.com/sliding-gc.pdf)

The generic dispatch optimization directly relates to something I'm going to be blogging about soon; I actually got into a disagreement in a job interview over this basic idea – that table-based dispatch, in general, is now often outperformed by switch-like code (carefully ordered integer comparison and branches), on modern hardware.

One question that was raised during this portion was how this approach compared with inline caches. I wanted to write something about that here but I'd better do some experiments first.

The GC approach was very cool, although I heard some grumbling from a few people about having done it before. Hendrik Boom mentioned that he had implemented a similar sliding GC in the '80s. R. Matthew Emerson mentioned that CCL has a nice mark&compact GC that no one has gotten around to writing a paper about.

The CLOS boostrapping portion discussed a technique called satiation to overcome metastability issues.

Since I've been living under a rock, I wasn't really aware of SICL, its goals, or its progress, but I find it terribly exciting, especially SICL's IR, Cleavir.

SICL, including the slides of Strandh's talk, can be found at https://github.com/robert-strandh/SICL.

3.7. A Transformation Based Approach to Semantics-Directed Code Generation

Speaker: Arthur Nunes-Harwitt.

Starting from the principle of a closure as a kind of primitive code generator, Nunes-Harwitt showed a series of relatively straight-forward transformations to create a compiler out of an interpreter. He then compared the performance of the result with Norvig's well-known Prolog from PAIP. He stressed that this was a manual technique, noting that in addition to the base transformations, he had also swapped out unify with a fast union-find implementation.

This was another condensation of multiple papers, and it was relatively difficult subject matter after a long day, which may have contributed to the paucity of questions asked.

3.8. Lightning talks 2

I guess one of the more unexpected presentations was that of (if I recall the name on the slide correctly) Esposito Louis. This was a presentation by a 14-year-old of a Lisp dialect he built, including an interactive graphical frontend. It was difficult to understand exactly what was being claimed, though, since the presentation basically was a very rapid scroll through an interactive notebook that seemed to be based on Maxima, punctuated with terse exclamations of the form, "here I demonstrate mappings in the complex plane …".

Greg Pfeil gave a brief talk on Quid Pro Quo, which I knew in its former life as dbc.lisp floating around on the net. This was a great and inspiring example of the kind of curation that needs to happen in the CL community if we want to really solve the library problem.

There was a brief announcement from Dave Cooper about the Common Lisp Foundation, which promised more CL-specific meetings and other good stuff. Probably the biggest relief was hearing that the domain names, like cliki.net, which have been bouncing around and renewed at the personal expense of various individuals, would finally be taken care of in a more responsible way. He also gave us a preview of a relaunched common-lisp.net which will probably be live by the time this blog post is disseminated.

3.9. Panel: "The Next Move for Lisp"

To conclude the conference, there was a panel about the future of Lisp, comprising Fare, Christian Quiennec, Nick Levine, Dave Cooper, and Greg Pfeil. I found this discussion frustrating, and I was pretty tired, so I apologize for the misrepresentations I am bound to make in this section.

The first topic for the panel was the Lisp community. Quiennec indicated that he didn't think there was such a thing as "the Lisp community" divorced from a specific language or implementation, which seems about right. Seeing as he was the only Schemer, most of the ensuing discussion too often conflated Lisp-in-general and Common Lisp.

There was only one question in the following discussion that was actually important: "where are the Clojure people?". I don't think we got a satisfactory answer to that question.

I asked why the demographics of this conference were so skewed. I was tired by this point of the day, so my question was probably pretty incoherent, unfortunately. I was however disappointed that it was completely ignored, and never addressed.

Trying to stir up trouble, I also mentioned the Smug Lisp Weenie image (which, real or imagined, is the biggest obstacle in the Lisp community, in my opinion), but no one bit. To me, one of the reasons Clojure won big was by not being called Lisp, which allowed it to escape a lot of the baggage associated with Lisp, especially the Smug Lisp Weenie aspect.

There was some discussion of the discoverability of Lisp; evidently the lack of a canonical forum has been a difficulty for some. There was a shoutout to #lisp, and an anti-shoutout to comp.lang.lisp.

The next two sections of discussion were on Lisp in innovation, and practical directions for Lisp, if I recall correctly. In any case, the topics became blurrier as the crowd started to interject with greater frequency. The core of the innovation part was basically, "why are all the PL researchers using Haskell/ML-family languages instead of Lisp?". The last section didn't have much coherency at all, except for treading over the usual watchwords: "libraries", "documentation", "curation", "community".

Robert Strandh pointed out something very important, which was that CL and Scheme are based on standards, in an age of languages defined by de-facto canonical implementations, and that this could be a source of strength if we paid attention to that advantage.

Near the end, R. Matthew Emerson said what hopefully everyone was thinking, which was that panels like this, filled with hand-wringing, tend to be pretty depressing; that important solutions like Quicklisp come out of people deciding to solve their own problems in Common Lisp; and that the most important thing any of us can do is to just go hack more Lisp, for which he received applause from all present.

Finally, as a reward to those who stayed til the end, there was a raffle for a Scheme-driven robot, which I won! (PICO-020, pictured above) Due to the overwhelming impressiveness of Esposito Louis's talk, he was awarded a second robot.

Marc Feeley promised to send me the whole toolchain. Although he hasn't yet, I found an associated paper and repo online.

4. Summary: Rekindling the flame

I noticed that several people had a similar story to my own: they'd drifted away from the Lisp world over the last few years, and going to ILC was a way to rekindle that flame. That could have been a more potent topic for the panel – CL went through a spike of interest in the mid-2000s; where did those people go, and what lessons can the community learn from that?

In any case, ILC worked for me. I came away from the conference eager to return to the CL community, to better curate the libraries and tools, and most importantly, to hack more Lisp.

Headless Testing of OpenGL Software

2014-08-11T02:30:00+0000

I've been resuscitating an old game I wrote in C; at the same time, I've been involved in a major refactoring effort for a client, which resulted in me rereading Michael Feathers' excellent book, Working Effectively with Legacy Code. Inspired by Feathers, I decided I would like to try to get 95%+ code coverage before I made any major changes to it.

I'm planning to write more about testing and games, but today I wanted to just announce one little aid to this process.

There's a fair bit of OpenGL code involving shaders that needs to be tested. A great solution to this problem is using Mesa's software renderer (OSMesa). Although it has its own bugs, it does also help to rule out driver-specific problems in the code, which is one of the big nightmares of writing OpenGL code. (Oh, and always set MESA_DEBUG in your environment when running your tests!)

The problem is that I use GLEW to deal with setting up extensions appropriately, and it doesn't play well with OSMesa. Searching the web, I see that a number of people have had this problem; indeed, chromium even have their own patched version of GLEW with OSMesa support. None of the solutions I found online worked well for me, though, so I contributed my own 80% solution (sorry Olin) which can be found on github:

https://github.com/tokenrove/glew/tree/headless-for-testing

This adds a linux-osmesa system definition that can be used to test GLEW-using code with OSMesa. The extensions used by my own code are relatively conservative so I wouldn't be surprised if more modern code does not work in this case, but hopefully this will help someone else out there.

The boustrophedonic madness of space-filling curves: ICFPC 2012 postmortem

2012-07-16T02:30:00+0000

The programming contest associated with the ICFP conference is, in my mind, the most prestigious programming competition currently running. The lack of restrictions compared to many competitions is an indication of its difficulty: anyone can enter, on teams or alone; almost any language is permissible; and the task changes several times during the competition.

Many years I have promised myself that I would compete, and many years I did at most one day. The morning of the second, the siren call of one of my own back-burner projects would wax louder, and I would wonder why I was solving someone else's problem.

This year, I resolved to endure the weekend, no matter what happened. My goal was to submit a solution, but that didn't happen, and here's my story why.

1. The problem, and my problems

The problem this year was basically Boulderdash without monsters, with a cellular automata model. Instead of diamonds, one collects lambdas. After the first announcement, I fully expected either some kind of cellular automata computation problem or the addition of monsters and other players. Unfortunately, neither expectation was correct.

In turn, the organizers introduced flooding (lower parts of the board gradually become hazardous), trampolines (effectively teleporters), beards and razors (a kind of amoeba that fills its Moore neighborhood, and a means of cutting beards), and higher-order rocks (lambdas hidden inside boulders).

I think my main problems this year were familiar ones for me in general: too much research, and overengineering. I often wonder whether having teammates might have prevented some of these problems. Maybe next year.

1.1. Too much R, not enough D

According to my org-mode files, I put in 40 hours of work this weekend, and at least 10 hours are attributed to pure research, although I know that many of the hours clocked on developing the state model were also research. I read (or at least skimmed) over 40 papers.

1.2. Overengineering

My solution involved a parent process that handled signals and executed children, restarting them if they crashed or exited before the time limit was reached (SIGINT sent from a harness), periodically reading the best solutions logged by the children. It involved Bloom filters, and representing state as a path on a space-filling curve¹ to increase cache coherency. It involved tcmalloc, and bit interleaving tricks. I was constantly engineering for the most extreme cases, and as a result, I never finished a working lifter (solver).

Once again, the agile null hypothesis stands: YAGNI, KISS, et cetera. Ignore this at your peril.

Figure 1: Madness therein lies.

2. Chronology

2.1. Friday

The announcement of the problem took me off guard, since I had expected it to begin on Friday evening, and it started at 12:00UTC.

I wasn't sure how I was going to do search, but I decided from the beginning that any good solution was going to need to be able to efficiently compute the next state, and probably represent states compactly.

My initial implementation was in Common Lisp, using bignums as bitplanes, with the goal being to do board update as a sequence of whole-board boolean operations. Looking back on the code, nothing is particularly interesting, although the following snipped demonstrates shifting the board in a given direction:

(logand (ash (bits p) (ecase dir (left -1) (right 1) (up (- w)) (down w)))
        (ldb (byte (* w h) 0) -1))

I spent a lot of time thinking about admissible heuristics. A* and its variants need a function $f(n) = g(n) + h(n)$, where $g(n)$ represents the path cost (here, path benefit) to node $n$, and $h(n)$ is an admissible heuristic for the potential benefit from node $n$ on til a goal state.²

One of the problems with the specified task is that every position is a goal state. Your score at any state is $c \cdot 25 \cdot \lambda - m$, where $\lambda$ is the number of lambdas collected, $m$ is the number of moves, and \[c = \left\{ \begin{array}{rl} 1 &\mbox{ if one is crushed or drowned} \\ 2 &\mbox{ if one aborts} \\ 3 &\mbox{ if one escapes on a lift} \end{array} \right. \]

You can abort at any time, and leave with points, therefore no smart program should ever be crushed or drowned, but you need a simulator that will actually alert you to the fatal move and back up to abort immediately upon finding the previous lambda (meaning that aborting right out of the gate is superior to making moves that don't find a lambda). The lift only works if you've collected all lambdas, and some maps may be unsolvable, therefore one cannot depend on reaching the lift as a goal state.

As for the admissibility of a heuristic, in our case, since we're looking at points scored rather than cost (although, since we know how many lambdas there are, we could express cost as difference from $75\cdot\lambda_\mbox{total}$), we need a function that does not under-estimate the potential value of a position. The nice thing about this idea is that you can merge several admissible heuristics by finding their minimum.

This, however, presumes that you can find such a heuristic for the problem at hand. I thought about the classic Sokoban heuristic (Manhattan distance from blocks to goals)³ and other tricks, but nothing seemed very satisfying. Playing the game manually (on Stefan Bühler's awesome Javascript simulator) demonstrated I lacked "domain" insight… how could I write a good heuristic? If you look at the original Rog-o-matic paper, they cited the use of domain knowledge from human experts as key to rog-o-matic's success. (Aside: I had intended on creating a graphical interactive version of the simulator, with alpha blended flooding indicators and danger indicators, but once I saw Stefan's simulator I didn't even bother.)

Had I been more familiar with recent game AI research, my difficulty in finding an admissible heuristic would have tipped me off to an alternate approach which I didn't discover until mid-afternoon Sunday.

My notes show I spent most of my time thinking about state representations, and only a few hours coding. I implemented a harness in perl that behaved as the competition harness would; the interesting portion being as follows:

my $pid = open2(\*LIFTER_OUT, \*LIFTER_IN, $LIFTER) or die $!;
eval {
    my $gracious = 1;
    local $SIG{ALRM} = sub {
        if($gracious) {
            kill 'INT', $pid; $gracious = 0; alarm 10;
        } else {
            kill 'KILL', $pid; alarm 0; die "Exceeded life expectancy.\n";
        }
    };
    alarm($TIME_TO_LIVE);       # 150 in competition

    print LIFTER_IN $map; close(LIFTER_IN); while(<LIFTER_OUT>) { $route .= $_; }
    waitpid $pid, 0;
    alarm 0;
};
die $@ if($@);

So it spawns the lifter, sets an alarm of 150 seconds, feeds it the map, and then waits to read the route from it. After the first timeout, it sends SIGINT. The process gets 10 more seconds grace before SIGKILL is sent.

2.2. Saturday

I spent the morning writing a test suite (using TAP so I could call it with prove) that took maps annotated with routes tested by hand, and compared the simulator's output to the results of the web validator. This revealed numerous discrepancies between my model of rocks and the web validator.

The organizers added trampolines, and this prompted much thought about graph structures, particularly the idea of walking the space from the robot's initial position, keeping track of the connected components of the graph and discarding anything else. I wondered if I could use some kind of complete heap storage approach so that the common case of four adjacencies would be implicit in the packed array storage, and still handle the exceptional case of trampolines (with four completely different adjacencies). At this point, I considered applying homotopic compaction⁴ to the empty space in levels to reduce graph size.

I spent a bit of time looking into monotonic paths, bumping into Catalan number a few times on the way, but to no useful end. (Note that a space-filling curve on a fixed grid as in our case is a kind of self-avoiding walk.) Lots of interesting mathematics, but nothing that was getting the code written any faster.

I also did a lot of fruitless research into Binary Decision Diagrams (described in ⁵ for example). There's an attractive A* variant called SetA*⁶ based on using BDDs that seemed like a way to prevent the massive state explosion I expected for this problem, but I just don't understand BDDs well enough yet to implement something like that in a weekend (definitely a project for the future, though). (Another A*-alike I discarded was D*-lite, which is also pretty cool.)

That afternoon, I wrote a new simulator in J, and though it wasn't the most productive use of my time, it was the most fun I had all weekend. J really is a delightful language. I just wish there was a good compiler for it.

Here's all the rock update code (whole map at once), for example:

updaterocks =: 3 : 0
NB. XXX should probably calculate rocks once but i like these trains so much...
  a =. (rocks *. (above @: empty)) y
  b =. (rocks *. (above @: rocks) *. (leftAndUpLeft @: empty)) y
  c =. (rocks *. (above @: rocks) *. (rightAndUpRight @: empty)) y
  d =. (rocks *. (above @: lambdas) *. (leftAndUpLeft @: empty)) y
  r =. rocks y
  (r *. (-. (a +. b +. c +. d))) +. (below r *. a) +. (right @: below r *. (b+.d)) +. (left @: below r *. c)
)

I figured I wasn't going to write a lifter in J, though, since the data structures and recursion in an A* search like SMA* are hard (for me) to reason about in J's "everything is an array" model. There is an example of A* search in J on the J software wiki looking fairly directly translated from AIMA². Any time I see the "explicit verb only" control structures like while., it's a red flag that we're straying outside J's domain.

(In fact, the simulator is the first code in which I've ever used a multi-statement if. / elseif. in J, and so I was bitten by the bizarre misfeature that else. cannot be combined with elseif. in J.)

But it sure was fun to hack on that stuff in J, especially with the array display from the REPL just doing the Right Thing as I played with it interactively. The tessellation operator in J is amazingly expressive, too.

My fun ended with the introduction of beards into the task. Beards grow into their Moore neighborhood every so many turns. Since there was no position independent rule to determine whether a beard or a rock had priority, I was forced to write a simulator that updated the board left-to-right, bottom-to-top, instead of all-at-once (which is, to me, much more elegant, and easily parallelizable). With that, I gave up on ideas like lazily streaming states in a local area around the robot to partially evaluate their merit.

This was the nadir for me, where I realized I was basically back at the beginning, having read dozens of papers but being no further along for it. I still didn't have a solver implemented, and I wasn't confident that a basic A* approach would even work. I had no good heuristics, especially with the introduction of beards, which must be shaved with razors which the robot can collect.

2.3. Sunday

I did some toying around in J, Lisp, and ATS, but nothing useful came out of it. The final task update came in: "higher-order rocks", which are rocks that contain lambdas which only become available if the rock is broken open by dropping it. This inspired even more research and less coding.

I forget how I stumbled across it, but amidst all these papers, I came across A Survey of Monte-Carlo Tree Search Methods by Browne et al, and it blew me away. Here was the perfect strategy for this problem. Indeed, a bit more searching led me to Schaad et al.'s paper on applying MCTS to single-player puzzles⁷ which described exactly my predicament. In the absence of a good heuristic function, here was a way to search, balancing exploitation and exploration as necessary for the problem. I could even use my parent-child model of processes to implement metasearch by blowing away the child every so often if it wasn't reporting new routes back to the parent often enough.

I took to the whiteboard (erasing flocks of Z- and U-shaped squiggles), and scribbled out a grand plan: a giant hash table would store states, resolving conflicts by choosing the state with the highest score (thus eliminating duplicate states achieved by different paths); a simulator would accept states linearized by Morton's Z-order curve and emit the next state (writing to a pair of buffers swapped each iteration, points, robot condition, and subsequent valid moves; Monte Carlo Tree Search would build a tree of routes by choosing random moves from a list of valid moves weighted by any heuristics we subsequently developed.

An example of the heuristic move weighting was to give a small probability bump to the "down" move in early turns on levels with flooding, to try to explore the bottom before it became completely flooded.

Another example of the felixibility of this method that Craig came up with as I was giving him a post-mortem on the competition was the idea of analyzing the distribution of lambdas when the map is first read, and using it to tune the balance between exploitation and exploration: explore more the further lambdas are apart, on average.

Of course, the tragic ending is that my approach was criminally overengineered, I got a quarter-way into my hyperoptimized C implementation, shaving every byte, and realized I had no way to finish it in the time remaining. So I went to sleep.

The MCTS idea is cool enough on its own that I am going to try and complete it in one form or another, but it's a shame about the competition. Maybe next year. I certainly learned a lot this time around, although some of them are lessons that should have been absorbed before now. I blame my obsession with space-filling curves.

Footnotes:

Worst case locality of a curve: \[\frac{d(p,q)^2}{A(p,q)}\] with $d(p,q)$ the distance between points $p$ and $q$, and $A(p,q)$ the area filled by the curve between $p$ and $q$.

See Russell and Norvig's Artificial Intelligence: a Modern Approach.

Junghanns and Schaeffer. "Sokoban: Enhancing general single-agent search methods using domain knowledge." In Artificial Intelligence 129, 2001.

⁴

F.S. Al-Anzi, "Efficient Cellular Automata Algorithms for Planar Graph and VLSI. Layout Homotopic Compaction."

⁵

D. E. Knuth, The Art of Computer Programming, Volume 4, Fascicle 1: Bitwise Tricks & Techniques; Binary Decision Diagrams, 12nd ed. Addison-Wesley Professional, Mar. 2009. Available: http://www.worldcat.org/isbn/0321580508

⁶

R.M. Jensen, R.E. Bryant and M.M. Veloso, "SetA*: An efficient BDD-Based Heuristic Search Algorithm". In Proceedings of 18th National Conference on Artificial Intelligence (AAAI'02), pages 668-673, 2002.

⁷

M.P.D. Schadd, M.H.M. Winands, H.J. van den Herik, G.M.J-B. Chaslot and J.W.H.M. Uiterwijk. Single-Player Monte-Carlo Tree Search. In Computers and Games, H.J. van den Herik and X. Xu and Z. Ma and M.H.M. Winands,eds., Springer, pages 1-12, Beijing, China, 2008.

A static blog compiler in emacs

2012-06-29T02:30:00+0000

Back in the mid-to-late '90s, I had a hideous homepage on geocities or something similar; in dark blue text on a black background, serving no purpose, as was the style of the time (but at least it was Lynx friendly!). Anyway, at the time, it seemed logical to me that one should statically compile such sites, using templates to insert uniform headers and footers. So, I implemented my own with m4 and make; it might have even had a link-checking pass, I can't recall. I wrote way too much m4 in those days.

Anyway, time passed, things on the web came and went. The era of "blogs" and "CMSes" came, and with it came crippled browser-based administration of said sites. I wanted no part, and continued to lead the life of the hermit.

Around the same time that I decided to become less private in my life, a static blog generator called jekyll came into vogue. It seemed to me that things were coming back in the right direction, and I gave it a shot. It worked okay for trivial things, so I used it for a few different sites. I won't get into my misgivings about Ruby; that's material for a later post on language hipsterism.

So I used jekyll for a while without too many hassles, until I started a post (forthcoming) on suffix arrays. I needed to embed some math, so I tried to use mathjax. Well, getting markdown to play well with mathjax wasn't working, so I converted the post to textile. RedCloth barfs on the first non-ASCII character in the post, so I took a look at the source and thought long and hard about whether I was going to fix this serious bug all so I could shoehorn my usecase into some rinkydink markup language.

I wanted to write my posts with org-mode, which has sensible LaTeX-style math input, tables, and syntax highlighting that plays well with emacs. Org has its own publishing features, but I wasn't going to let that stop me from reinventing the wheel for the nth time, alas. So I wrote a quick Jekyll-replacement that runs inside emacs and uses org-mode as the post format.

I replaced the liquid templating with a simple approach of reading and evaluating the contents of the tag as a Lisp expression. So, there's no interleaving of template tags and content, which hasn't been a problem for me yet.

Though it's a quick hack, I'm happy with it – it scratches my itch, and I'm sure to improve it as time goes on. Maybe it can turn into something useful for other people eventually, too.

Given those caveats, feel free to download the source code.

Shred for Satan initial release

2012-04-02T02:30:00+0000

Here's a little GTK-based metronome I wrote for Kyla to practice the Molt material. It was created because the material contains a lot of meter and tempo changes, which are hard to practice with a conventional metronome. So this one reads key, meter, and tempo changes from a MIDI file. Since we prepare all our scores with Lilypond, we have MIDI files that include all the requisite information.

It needs ocaml, lablgtk, ocaml-bitstring, and a recent version of ocaml-portaudio (one that uses portaudio v19) to build.

Despite the fact that it uses portaudio, I've only tested it with JACK running. There's something broken in portaudio's interaction with JACK, so you might find that it won't obtain the correct ALSA device if JACK isn't running.

A kernel driver for legacy Wacom serial tablets

2011-07-02T02:30:00+0000

Update (14/08/20): It looks like this driver will be included in Linux kernel 3.17, thanks to the labors of Hans de Goede. It should no longer be necessary to use the version linked here.

Update (12/03/27): There is a new version of this driver available, which includes a patched version of inputattach, here. This includes support for PenPartner tablets. For Intuos tablets, look here.

Having gotten back to doing some art on computer, I decided to dust off my old Wacom Digitizer II again. It's always a bit of an adventure trying to get it to work on a new system, as some configuration system has always completely changed since the last time I hooked it up. However, this time, I discovered that while the general approach to detecting and configuring input devices had improved a lot, support for these old serial Wacom tablets had been completely removed from the xorg Wacom input driver!

Initially I was pretty irritated, as you can imagine, but after looking at the code that had been excised, it was clear that this was for the best. Given the new(ish) approach to handling input devices in the Linux kernel, having all the support for the device on the X side is now clearly the Wrong Thing. So, I set about reading as much code as possible related to serial Wacom tablets, and writing a serio-based driver.

Along the way, it seemed to me that this would be cleaner if protocol four (like my Digitizer II) and protocol five (newer tablets like the Intuos series) devices were supported separately. So, Intuos owners, I regret to say that the driver presented here does not support your devices, though I wouldn't mind trying to write a driver to support them.

Aside from the inevitable actual bugs to be discovered, this driver currently does not support (at least):

pad buttons;
tilt;
suppression;
cursor devices (some things are missing to fully support these devices).

To use it presently, you'll need to do a few things: (instructions apply to Debian systems but should be easily adapted elsewhere)

Unpack and build the module:
```
$ tar xzf wacom_serial.tar.gz
$ cd wacom_serial
$ make all
```
That should produce wacom_serial.ko if you've got things otherwise configured correctly for building modules against your current kernel version. Then:
```
$ sudo insmod ./wacom_serial.ko
```

Patch and build inputattach (in the joystick package) with the included patch:

$ apt-get source joystick
$ cd joystick-1.4.1
$ patch -p1 < ~/wacom_serial/inputattach.patch
$ dpkg-buildpackage
$ sudo dpkg -i ../inputattach-1.4.1-1_powerpc.deb

(Adjust paths to things per your case, of course.)

Add the included 70-serial-wacom.rules file to your local udev rules (put it in /etc/udev/rules.d).
Connect your tablet, turn it on, and run:
```
$ sudo inputattach --wacom_iv /dev/ttyS0
```
where ttyS0 is the device for the serial port to which the tablet is attached. USB serial adapters usually show up as /dev/ttyUSBn.

At this point, if everything else on your system is fairly current (including the xf86-input-wacom module and its configuration), your tablet should hopefully work in X. Let me know.

So far, I've only tested it on Linux kernel 2.6.39, i386 and powerpc.

You can get the driver here: wacom serial-110702-0.tar.gz. If you have a Wacom serial tablet, please try it out and let me know what happens, success or failure regardless. Please also send any messages logged (usually to /var/log/kern.log) from the point where you attached the device with inputattach.

This driver was developed with reference to much code written by others, particularly:

elo, gunze drivers by Vojtech Pavlik;
wacom_w8001 driver by Jaya Kumar;
the USB wacom input driver, credited to many people (see drivers/input/tablet/wacom.h);
new and old versions of linuxwacom / xf86-input-wacom credited to Frederic Lepied, Ping Cheng, and Jon E. Joganic;
and xf86wacom.c (a presumably ancient version of the linuxwacom code), by Frederic Lepied and Raph Levien.

Molt live, July 21st

2011-06-16T02:30:00+0000

My eccentric death metal band, Molt, will be playing Barfly on July 21st. Further details to come soon — keep an eye on the feed at molt.ca or the event page on last.fm.

Anaphora 0.9.4 released

2011-06-15T02:30:00+0000

Just shy of the fifth anniversary of the last release, anaphora 0.9.4 has been released. This release is mostly some accumulated minor bug fixes, though it also adds ALET and SLET.

Anaphora is an anaphoric macro package for Common Lisp, allowing code like this:

(define-binary-type array (type count)
  (:reader (in)
    (aprog1 (make-array count :element-type type)
      (loop for i below count do (setf (svref it i) (read-value type in)))))
  (:writer (out array)
    (loop for v across array do (write-value type out v))))

(defun get-faces (chunk)
  (awhen (recursively-seek-chunk 'face-list chunk)
    (faces-of it)))

(APROG1, IT, and AWHEN are symbols from ANAPHORA).