dark theme

Why you shouldn't concatenate strings in erlang

In the previous article, I said that concatenating strings was an O(n) operation on the length of the first string.

I feel the need to explain that.

In erlang, strings are lists of integer. Since they are integers, they can store any unicode characters such as the snowman ☃ which has a value of 9731.

When one wants to concatenate 2 strings, one can use string:concat/2 implemented in lib/stdlib/src/string.erl as follows:

(All code extracts in this article are subject to the Erlang Public License.)

1
2
3
4
5
6
7
8
9
%% concat(String1, String2)
%%  Concatenate 2 strings.

-spec concat(String1, String2) -> String3 when
      String1 :: string(),
      String2 :: string(),
      String3 :: string().

concat(S1, S2) -> S1 ++ S2.

Ok, strings are lists, just concatenate lists!

Down the rabbit hole

The ++ operator is a BIF, a Built-In Function implemented in C in the Beam VM. It is implemented in erts/emulator/beam/erl_bif_lists.c.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
/*
 * erlang:'++'/2
 */

Eterm
ebif_plusplus_2(BIF_ALIST_2)
{
    return append(BIF_P, BIF_ARG_1, BIF_ARG_2);
}

static BIF_RETTYPE append(Process* p, Eterm A, Eterm B)
{
    Eterm list;
    Eterm copy;
    Eterm last;
    size_t need;
    Eterm* hp;
    int i;

    if ((i = list_length(A)) < 0) {
        BIF_ERROR(p, BADARG);
    }
    if (i == 0) {
        BIF_RET(B);
    } else if (is_nil(B)) {
        BIF_RET(A);
    }

    need = 2*i;
    hp = HAlloc(p, need);
    list = A;
    copy = last = CONS(hp, CAR(list_val(list)), make_list(hp+2));
    list = CDR(list_val(list));
    hp += 2;
    i--;
    while(i--) {
        Eterm* listp = list_val(list);
        last = CONS(hp, CAR(listp), make_list(hp+2));
        list = CDR(listp);
        hp += 2;
    }
    CDR(list_val(last)) = B;
    BIF_RET(copy);
}

First, the length of A is computed. This operation is O(n) where n is the length of the list. Trust me on this one, of just go look at erts/emulator/beam/utils.c.

If either lists are empty, return the other one.

On lines 29 - 30, space is allocated on the process heap for a new list. Each list item has a size of 2 Eterm: one for the list itself, one for the element.

The code on lines 31 to 42 just copies A to our newly allocated list and adds B on its tail.

last = CONS(hp, CAR(list_val(list)), make_list(hp+2));

Ok, what is that?

From erts/emulator/beam/erl_term.h, we need some more definitions:

1
2
3
4
5
#define CONS(hp, car, cdr) \
        (CAR(hp)=(car), CDR(hp)=(cdr), make_list(hp))

#define CAR(x)  ((x)[0])
#define CDR(x)  ((x)[1])

make_list is used to return an Eterm tagged as a list from a pointer on the process heap.

list_val does the opposite and returns the address on the heap of the list.

CONS puts 2 elements in hp[0] (the element in the list) and hp[1] (the next list item) and returns the list at hp as an Eterm.

That complicated expression can now be read as:

  • copy in hp[0] the element of the list we want to copy,
  • set the next list item as the Eterm following on the stack (hp+2) since the new list is allocated as one block on the stack.

When list A has been fully copied, the last element is set to B at the line 42 with CDR(list_val(last)) = B;.

Memory allocation

If you're curious, erts/emulator/beam/sys.h can tell you that an Eterm has usually a size of 4 bytes.

Small integers can be stored directly in those 4 bytes. There are 26 bits usable to store a small integer so it can store up to 2²⁵ = 33554432 because of the sign bit. For reference, pile of poo which was added in october 2010 has a value of 128169.

We can consider that unicode characters can be stored directly in the list.

If a string has n characters, it will use 2 * n words.

A binary on the opposite will use 3 to 6 words (depending on the data size) plus the size of the data itself.

A string stored as a list takes about 8 times the space it would take stored as a binary.

What about other lists?

The pointed values are not copied, only the list items themsleves.

Appending a list of records of n tuples to a list of m tuples will take the same time/memory as concatenating a string of length n to one of length m.

The only difference is that tuples are not boxed in an Eterm. The list only contains an Eterm as a pointer to the tuple: where it's stored in the heap.

Solution?

Of course there is one! You should use binary and IoList whenever possible. An IoList is iolist = [char() | binary() | iolist()]. They are “deep-list”.

Want to append an IoList a to an IoList b? Simple! Just create a new list with 2 elements: a and b! Concatenation has become an O(1) operation. Most IO apis accept IoLists.

blog comments powered by Disqus

If you enjoyed this article, feel free to donate in Bitcoin at 1BWTT22RTERhPszYTybZnrVgScKQpD41i4