When is an Erlang iolist an iovec? [erlang, unix, c]

While trying to improve the performance of JSON encoding in an Erlang application last year, I came to wonder about the different representations one can use when writing data to disk or sending it over a socket, and how they map to the OS's underlying facilities.

In Erlang, there is a convention of deferring the creation of large binaries, whereby functions accept a list called an iolist, composed of binaries, characters (integers), and other, nested, iolists. This means that you don't need to waste time and space copying a bunch of pieces of a message into a single buffer before writing it out to a socket, for example.

If you've done much network programming, that concept probably sounds familiar; it's like a less-restrictive version of the struct iovec scatter-gather structures used in many Unix IO calls (see the glibc documentation for example).

Since there's an obvious connection, I asked myself the titular question for this post: when does an Erlang iolist map most closely to an iovec? Is there a layout that minimizes copying and allocation?

tl;dr: Erlang does a pretty good job of this, so don't worry about it and just use iolists. A list of large refcounted binaries maps closely to an iovec with minimal extra copying. Also, some drivers don't even support iolists!

1. Establishing the relationship

I had previously noticed the SysIOVec structure in the ERTS source, which erts/emulator/sys/unix/driver_int.h defines as:

typedef struct iovec SysIOVec;

(As I was fact-checking this article, I noticed that a chapter of the ERTS reference manual is explicit in making this connection. People often complain about the Erlang documentation, but part of the problem is just that you might never think to read a chapter called "How to implement an alternative carrier for the Erlang distribution".)

ErlIOVec, which contains a SysIOVec, is defined like this:

typedef struct erl_io_vec {
    int vsize;                  /* length of vectors */
    ErlDrvSizeT size;           /* total size in bytes */
    SysIOVec* iov;
    ErlDrvBinary** binv;
} ErlIOVec;

There's an erlang-questions mailing list post by Scott Fritchie that points to the functions io_list_to_vec() and io_list_vec_len(). Let's take a look at them.

2. Reading the source

Here's the beginning of io_list_vec_len: (in erts/emulator/beam/io.c)

/* 
 * Returns 0 if successful and a non-zero value otherwise.
 *
 * Return values through pointers:
 *    *vsize      - SysIOVec size needed for a writev
 *    *csize      - Number of bytes not in binary (in the common binary)
 *    *pvsize     - SysIOVec size needed if packing small binaries
 *    *pcsize     - Number of bytes in the common binary if packing
 *    *total_size - Total size of iolist in bytes
 */

static int 
io_list_vec_len(Eterm obj, int* vsize, Uint* csize,
                Uint* pvsize, Uint* pcsize,
                ErlDrvSizeT* total_size)
{

This is called by erts_port_output, and a common binary gets allocated based on the csize returned. In our thought experiment here, we want to avoid allocating this common binary, and we would like to make sure vsize is minimized and that we can efficiently pack the iovec without allocation and copying.

Note that we figure out both packed and unpacked sizes. Basically, if the unpacked vsize is small enough, we won't pack, since that means less work (as we'll see below).

It's worth noting here that copying things into a common binary, although it will cause an allocation, can be much faster than forcing the OS to work with a long iovec.

io_list_vec_len() continues:

    DECLARE_ESTACK(s);
    Eterm* objp;
    Uint v_size = 0;
    Uint c_size = 0;
    Uint b_size = 0;
    Uint in_clist = 0;
    Uint p_v_size = 0;
    Uint p_c_size = 0;
    Uint p_in_clist = 0;
    Uint total; /* Uint due to halfword emulator */

erts/emulator/beam/global.h says of DECLARE_ESTACK: "Here is an implementation of a lightweiht stack." (sic) It basically lays out a little array of terms, initially on the stack (up to 16 items), which can be migrated to the heap if the stack grows too much.

(Reading the ERTS allocator code is an easy way to spend a lot of time; it's pretty convoluted, and figuring out exactly which allocator ends up doing what is non-obvious on one's first few encounters with that code.)

    goto L_jump_start;  /* avoid a push */

    while (!ESTACK_ISEMPTY(s)) {
        obj = ESTACK_POP(s);
    L_jump_start:

We begin by jumping into the loop. Note throughout this code the careful use of goto to avoid unnecessary churn on the stack.

        if (is_list(obj)) {
        L_iter_list:
            objp = list_val(obj);
            obj = CAR(objp);

If obj is a cons cell, we inspect the head.

            if (is_byte(obj)) {
                c_size++;
                if (c_size == 0) {
                    goto L_overflow_error;
                }
                if (!in_clist) {
                    in_clist = 1;
                    v_size++;
                }
                p_c_size++;
                if (!p_in_clist) {
                    p_in_clist = 1;
                    p_v_size++;
                }
            }

If it's a byte (specifically, a fixnum under 256), we add to the size of the common binary required. If we weren't already in a section that will point into the common binary, we have to add to the size of the underlying iovec. Note the behavior is the same for packed and unpacked.

So our first rule is probably going to be "avoid interspersing binaries and strings". This will keep vsize lower.

            else if (is_binary(obj)) {
                IO_LIST_VEC_COUNT(obj);
            }

Back to our loop, still handling the elements of a list, if we got a binary instead, we invoke IO_LIST_VEC_COUNT.

This is a monster macro in the same file. Let's take a look in pieces:

#define IO_LIST_VEC_COUNT(obj)                                          \
do {                                                                    \
    Uint _size = binary_size(obj);                                      \
    Eterm _real;                                                        \
    ERTS_DECLARE_DUMMY(Uint _offset);                                   \
    int _bitoffs;                                                       \
    int _bitsize;                                                       \
    ERTS_GET_REAL_BIN(obj, _real, _offset, _bitoffs, _bitsize);         \
    if (_bitsize != 0) goto L_type_error;                               \

We start by getting a bunch of properties about the binary, and erroring out if this is a bitstring. In Erlang, a bitstring is a binary with a number of bits which isn't a multiple of 8. Note that we don't error out if this binary has a non-octet bit offset.

    if (thing_subtag(*binary_val(_real)) == REFC_BINARY_SUBTAG &&       \
        _bitoffs == 0) {                                                \
        b_size += _size;                                                \
        if (b_size < _size) goto L_overflow_error;                      \
        in_clist = 0;                                                   \
        v_size++;                                                       \

If this is a byte-aligned refcounted binary, we could put it a reference right in the iovec. This is the main answer to this blog post's question.

        /* If iov_len is smaller then Uint we split the binary into*/   \
        /* multiple smaller (2GB) elements in the iolist.*/             \
        v_size += _size / MAX_SYSIOVEC_IOVLEN;                          \

The commit for this code, added in 2016, notes:

On windows the max size of an iov element is long, i.e. 4GB so in order to write larger binaries to file we split the binary into smaller 2GB chunks so that the write is possible.

I bet that was inspired by a fun bug.

        if (_size >= ERL_SMALL_IO_BIN_LIMIT) {                          \
            p_in_clist = 0;                                             \
            p_v_size++;                                                 \
        } else {                                                        \
            p_c_size += _size;                                          \
            if (!p_in_clist) {                                          \
                p_in_clist = 1;                                         \
                p_v_size++;                                             \
            }                                                           \
        }                                                               \

This code applies only to our packed counts. ERL_SMALL_IO_BIN_LIMIT is 4*ERL_ONHEAP_BIN_LIMIT, which is 4×64 = 256. So if the binary is less than 256 bytes, and we need to pack, we'll copy it into the common binary instead of referencing it directly.

    } else {                                                            \
        c_size += _size;                                                \
        if (c_size < _size) goto L_overflow_error;                      \
        if (!in_clist) {                                                \
            in_clist = 1;                                               \
            v_size++;                                                   \
        }                                                               \
        p_c_size += _size;                                              \
        if (!p_in_clist) {                                              \
            p_in_clist = 1;                                             \
            p_v_size++;                                                 \
        }                                                               \
    }                                                                   \
} while (0)

Otherwise, if this is a heap binary, we always copy it into the common binary.

Back in io_list_vec_len():

            else if (is_list(obj)) {
                ESTACK_PUSH(s, CDR(objp));
                goto L_iter_list;   /* on head */
            }

If we got a list, we push it on the stack to deal with later.

            else if (!is_nil(obj)) {
                goto L_type_error;
            }

Finally, if there's anything else that isn't the empty list, that's an error.

            obj = CDR(objp);
            if (is_list(obj))
                goto L_iter_list;   /* on tail */
            else if (is_binary(obj)) {  /* binary tail is OK */
                IO_LIST_VEC_COUNT(obj);
            }
            else if (!is_nil(obj)) {
                goto L_type_error;
            }

Here we handle the tail of the list. It's interesting that an iolist doesn't need to be a proper list (you can have [List | <<"bin">>]).

So, our conclusions so far would seem to be that an Erlang iolist maps most closely to an iovec when it is an arbitrarily nested list of refcounted binaries – and if there are more than 16 of them, they must be at least 256 bytes long each.

Since the stack for processing these lists changes its allocation strategy around 16 items deep, in practice you probably don't want your iolists to be too deeply nested.

Let's look at io_list_to_vec:

static int
io_list_to_vec(Eterm obj,       /* io-list */
               SysIOVec* iov,   /* io vector */
               ErlDrvBinary** binv, /* binary reference vector */
               ErlDrvBinary* cbin, /* binary to store characters */
               ErlDrvSizeT bin_limit)   /* small binaries limit */

The comments here are pretty clear; obj is the input and iov is the output. Whether to pack or not is expressed by bin_limit. As we look at the rest of the code, we'll see how the other arguments are used. The function begins with these declarations:

{
    DECLARE_ESTACK(s);
    Eterm* objp;
    char *buf  = cbin->orig_bytes;
    Uint len = cbin->orig_size;
    Uint csize  = 0;
    int vlen   = 0;
    char* cptr = buf;

Note that we setup buf, len, and cptr according to cbin. Their use will become clear as we go on.

We continue with a jump into the body of our loop:

    goto L_jump_start;  /* avoid push */

    while (!ESTACK_ISEMPTY(s)) {
        obj = ESTACK_POP(s);
    L_jump_start:

We can see that the structure of this function mirrors that of io_list_vec_len().

        if (is_list(obj)) {
        L_iter_list:
            objp = list_val(obj);
            obj = CAR(objp);

If our term is a cons, we look at the car.

            if (is_byte(obj)) {
                if (len == 0)
                    goto L_overflow;
                *buf++ = unsigned_val(obj);
                csize++;
                len--;

If it's a byte, we store that byte in the buffer passed in as cbin.

            } else if (is_binary(obj)) {
                ESTACK_PUSH(s, CDR(objp));
                goto handle_binary;

If it's a binary, we push the tail of the list we're working on onto the stack so we pick up there later, and jump to handle_binary.

            } else if (is_list(obj)) {
                ESTACK_PUSH(s, CDR(objp));
                goto L_iter_list;    /* on head */
            } else if (!is_nil(obj)) {
                goto L_type_error;
            }

If it's a list, save our current position in the outer list to the stack, and start walking the inner list.

            obj = CDR(objp);
            if (is_list(obj))
                goto L_iter_list; /* on tail */
            else if (is_binary(obj)) {
                goto handle_binary;
            } else if (!is_nil(obj)) {
                goto L_type_error;
            }

How the cdr of the cons is handled should be unsurprising by now.

        } else if (is_binary(obj)) {
            Eterm real_bin;
            Uint offset;
            Eterm* bptr;
            ErlDrvSizeT size;
            int bitoffs;
            int bitsize;

        handle_binary:
            size = binary_size(obj);
            ERTS_GET_REAL_BIN(obj, real_bin, offset, bitoffs, bitsize);
            ASSERT(bitsize == 0);

The binary handling case isn't too different from IO_LIST_VEC_COUNT above. We die immediately if this is a bitstring.

            bptr = binary_val(real_bin);
            if (*bptr == HEADER_PROC_BIN) {
                ProcBin* pb = (ProcBin *) bptr;
                if (bitoffs != 0) {
                    if (len < size) {
                        goto L_overflow;
                    }
                    erts_copy_bits(pb->bytes+offset, bitoffs, 1,
                                   (byte *) buf, 0, 1, size*8);
                    csize += size;
                    buf += size;
                    len -= size;

It's interesting that the bit offset is handled here, even though extra bits aren't permitted. The git history associated with this block is not helpful.

                } else if (bin_limit && size < bin_limit) {
                    if (len < size) {
                        goto L_overflow;
                    }
                    sys_memcpy(buf, pb->bytes+offset, size);
                    csize += size;
                    buf += size;
                    len -= size;

If we're packing small binaries and this one is below the limit, copy it in.

                } else {
                    if (csize != 0) {
                        io_list_to_vec_set_vec(&iov, &binv, cbin,
                                               cptr, csize, &vlen);
                        cptr = buf;
                        csize = 0;
                    }
                    if (pb->flags) {
                        erts_emasculate_writable_binary(pb);
                    }
                    io_list_to_vec_set_vec(
                        &iov, &binv, Binary2ErlDrvBinary(pb->val),
                        pb->bytes+offset, size, &vlen);
                }

Otherwise, we make the direct translation to an iovec entry. If we need to emit a reference to part of the common binary, do that first.

Note the curiously-named erts_emasculate_writable_binary() which seems to shrinkwrap the binary (reallocate it to trim unused space).

            } else {
                ErlHeapBin* hb = (ErlHeapBin *) bptr;
                if (len < size) {
                    goto L_overflow;
                }
                copy_binary_to_buffer(buf, 0,
                                      ((byte *) hb->data)+offset, bitoffs,
                                      8*size);
                csize += size;
                buf += size;
                len -= size;
            }
        } else if (!is_nil(obj)) {
            goto L_type_error;
        }
    }

If this is a heap binary, it's much less interesting. We just append to the common binary.

    if (csize != 0) {
        io_list_to_vec_set_vec(&iov, &binv, cbin, cptr, csize, &vlen);
    }

After all that, we reference the tail of the common binary, if we have anything left in it.

Finally we come to the end of the function:

    DESTROY_ESTACK(s);
    return vlen;

 L_type_error:
    DESTROY_ESTACK(s);
    return -2;

 L_overflow:
    DESTROY_ESTACK(s);
    return -1;
}

Not much to see here, but we might as well include it as we've looked at every other part.

This isn't complete without noting how this pair of functions is called by functions like erts_port_output(). I won't get into the guts, but if you look in the same file, you'll see that an iovec of SMALL_WRITE_VEC (16) is setup on the stack, and if io_list_vec_len() reports a vsize larger than that, it has to allocate the space using the ERTS_ALC_T_TMP allocator.

Even if csize is zero, a binary gets allocated anyway (driver_alloc_binary(0) returns a valid allocation). That seems like a waste but I guess it makes subsequent logic simpler.

3. Confirming what we've found

When I decided to write a test to verify this, I had some problems. First, I thought writing to /dev/null with file:write/2 would be the easiest way to isolate the differences. After some puzzling initial results and investigation with strace, I discovered that file:write/2 always converts the iolist to a binary before sending it to the file driver! (I imagine this is to avoid copying long strings.)

Then I tried sending data to cat >/dev/null opened as a port. At least now strace confirmed that writev was being called, but everything was being packed… a quick trip into gdb revealed drv->outputv wasn't set — this driver doesn't think it supports iovecs!

Ok, so a quick grep in erts/drivers/ reveals a lot of drivers don't define outputv. The only thing I could confirm would definitely use iovecs all the way was the TCP driver. An informal test, with ncat -k -l >/dev/null on the other side, confirmed the differences, and showed the case where the iovec structure is preserved rather than packed as being three times slower than packing cases.

So, although this is interesting, being careful about the iolist structure generated isn't likely to get me any big wins for JSON encoding.

The iolist-iovec correspondance is probably most useful when you need to send really large iolists, ones that might be too large to allocate in one place.

JS