dark theme

Why you shouldn't concatenate strings in erlang

In the previous article, I said that concatenating strings was an O(n) operation on the length of the first string.

I feel the need to explain that.

In erlang, strings are lists of integer. Since they are integers, they can store any unicode characters such as the snowman ☃ which has a value of 9731.

When one wants to concatenate 2 strings, one can use string:concat/2 implemented in lib/stdlib/src/string.erl as follows:

(All code extracts in this article are subject to the Erlang Public License.)

1
2
3
4
5
6
7
8
9
%% concat(String1, String2)
%%  Concatenate 2 strings.

-spec concat(String1, String2) -> String3 when
      String1 :: string(),
      String2 :: string(),
      String3 :: string().

concat(S1, S2) -> S1 ++ S2.

Ok, strings are lists, just concatenate lists!

Down the rabbit hole

The ++ operator is a BIF, a Built-In Function implemented in C in the Beam VM. It is implemented in erts/emulator/beam/erl_bif_lists.c.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
/*
 * erlang:'++'/2
 */

Eterm
ebif_plusplus_2(BIF_ALIST_2)
{
    return append(BIF_P, BIF_ARG_1, BIF_ARG_2);
}

static BIF_RETTYPE append(Process* p, Eterm A, Eterm B)
{
    Eterm list;
    Eterm copy;
    Eterm last;
    size_t need;
    Eterm* hp;
    int i;

    if ((i = list_length(A)) < 0) {
        BIF_ERROR(p, BADARG);
    }
    if (i == 0) {
        BIF_RET(B);
    } else if (is_nil(B)) {
        BIF_RET(A);
    }

    need = 2*i;
    hp = HAlloc(p, need);
    list = A;
    copy = last = CONS(hp, CAR(list_val(list)), make_list(hp+2));
    list = CDR(list_val(list));
    hp += 2;
    i--;
    while(i--) {
        Eterm* listp = list_val(list);
        last = CONS(hp, CAR(listp), make_list(hp+2));
        list = CDR(listp);
        hp += 2;
    }
    CDR(list_val(last)) = B;
    BIF_RET(copy);
}

First, the length of A is computed. This operation is O(n) where n is the length of the list. Trust me on this one, of just go look at erts/emulator/beam/utils.c.

If either lists are empty, return the other one.

On lines 29 - 30, space is allocated on the process heap for a new list. Each list item has a size of 2 Eterm: one for the list itself, one for the element.

The code on lines 31 to 42 just copies A to our newly allocated list and adds B on its tail.

last = CONS(hp, CAR(list_val(list)), make_list(hp+2));

Ok, what is that?

From erts/emulator/beam/erl_term.h, we need some more definitions:

1
2
3
4
5
#define CONS(hp, car, cdr) \
        (CAR(hp)=(car), CDR(hp)=(cdr), make_list(hp))

#define CAR(x)  ((x)[0])
#define CDR(x)  ((x)[1])

make_list is used to return an Eterm tagged as a list from a pointer on the process heap.

list_val does the opposite and returns the address on the heap of the list.

CONS puts 2 elements in hp[0] (the element in the list) and hp[1] (the next list item) and returns the list at hp as an Eterm.

That complicated expression can now be read as:

  • copy in hp[0] the element of the list we want to copy,
  • set the next list item as the Eterm following on the stack (hp+2) since the new list is allocated as one block on the stack.

When list A has been fully copied, the last element is set to B at the line 42 with CDR(list_val(last)) = B;.

Memory allocation

If you're curious, erts/emulator/beam/sys.h can tell you that an Eterm has usually a size of 4 bytes.

Small integers can be stored directly in those 4 bytes. There are 26 bits usable to store a small integer so it can store up to 2²⁵ = 33554432 because of the sign bit. For reference, pile of poo which was added in october 2010 has a value of 128169.

We can consider that unicode characters can be stored directly in the list.

If a string has n characters, it will use 2 * n words.

A binary on the opposite will use 3 to 6 words (depending on the data size) plus the size of the data itself.

A string stored as a list takes about 8 times the space it would take stored as a binary.

What about other lists?

The pointed values are not copied, only the list items themsleves.

Appending a list of records of n tuples to a list of m tuples will take the same time/memory as concatenating a string of length n to one of length m.

The only difference is that tuples are not boxed in an Eterm. The list only contains an Eterm as a pointer to the tuple: where it's stored in the heap.

Solution?

Of course there is one! You should use binary and IoList whenever possible. An IoList is iolist = [char() | binary() | iolist()]. They are “deep-list”.

Want to append an IoList a to an IoList b? Simple! Just create a new list with 2 elements: a and b! Concatenation has become an O(1) operation. Most IO apis accept IoLists.

blog comments powered by Disqus

If you enjoyed this article, feel free to Flattr thisflatter it