Why you shouldn't concatenate strings in erlang
In the previous
article, I said that concatenating strings was an
O(n) operation on the length of the first
string.
I feel the need to explain that.
In erlang, strings are lists of integer. Since they are integers, they can store any unicode characters such as the snowman ☃ which has a value of 9731.
When one wants to concatenate 2 strings, one can use
string:concat/2 implemented in
lib/stdlib/src/string.erl as follows:
(All code extracts in this article are subject to the Erlang Public License.)
1 2 3 4 5 6 7 8 9 |
%% concat(String1, String2) %% Concatenate 2 strings. -spec concat(String1, String2) -> String3 when String1 :: string(), String2 :: string(), String3 :: string(). concat(S1, S2) -> S1 ++ S2. |
Ok, strings are lists, just concatenate lists!
Down the rabbit hole
The ++ operator is a BIF, a Built-In
Function implemented in C in the Beam VM. It is implemented
in
erts/emulator/beam/erl_bif_lists.c.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 |
/* * erlang:'++'/2 */ Eterm ebif_plusplus_2(BIF_ALIST_2) { return append(BIF_P, BIF_ARG_1, BIF_ARG_2); } static BIF_RETTYPE append(Process* p, Eterm A, Eterm B) { Eterm list; Eterm copy; Eterm last; size_t need; Eterm* hp; int i; if ((i = list_length(A)) < 0) { BIF_ERROR(p, BADARG); } if (i == 0) { BIF_RET(B); } else if (is_nil(B)) { BIF_RET(A); } need = 2*i; hp = HAlloc(p, need); list = A; copy = last = CONS(hp, CAR(list_val(list)), make_list(hp+2)); list = CDR(list_val(list)); hp += 2; i--; while(i--) { Eterm* listp = list_val(list); last = CONS(hp, CAR(listp), make_list(hp+2)); list = CDR(listp); hp += 2; } CDR(list_val(last)) = B; BIF_RET(copy); } |
First, the length of A is computed. This
operation is O(n) where n is the
length of the list. Trust me on this one, of just go look
at
erts/emulator/beam/utils.c.
If either lists are empty, return the other one.
On lines 29 - 30, space is allocated on the process heap
for a new list. Each list item has a size of 2
Eterm: one for the list itself, one for the
element.
The code on lines 31 to 42 just copies A to
our newly allocated list and adds B on its tail.
last = CONS(hp, CAR(list_val(list)),
make_list(hp+2));
Ok, what is that?
From
erts/emulator/beam/erl_term.h, we need
some more definitions:
1 2 3 4 5 |
#define CONS(hp, car, cdr) \ (CAR(hp)=(car), CDR(hp)=(cdr), make_list(hp)) #define CAR(x) ((x)[0]) #define CDR(x) ((x)[1]) |
make_list is used to return an
Eterm tagged as a list from a pointer on the
process heap.
list_val does the opposite and returns the
address on the heap of the list.
CONS puts 2 elements in hp[0]
(the element in the list) and hp[1] (the next
list item) and returns the list at hp as an
Eterm.
That complicated expression can now be read as:
- copy in
hp[0]the element of the list we want to copy, - set the next list item as the
Etermfollowing on the stack (hp+2) since the new list is allocated as one block on the stack.
When list A has been fully copied, the last
element is set to B at the line 42 with
CDR(list_val(last)) = B;.
Memory allocation
If you're curious,
erts/emulator/beam/sys.h can tell you that
an Eterm has usually a size of 4 bytes.
Small integers can be stored directly in those 4 bytes. There are 26 bits usable to store a small integer so it can store up to 2²⁵ = 33554432 because of the sign bit. For reference, pile of poo which was added in october 2010 has a value of 128169.
We can consider that unicode characters can be stored directly in the list.
If a string has n characters, it will use
2 * n words.
A binary on the opposite will use 3 to 6
words (depending on the data size) plus the size of the
data itself.
A string stored as a list takes about 8 times the space
it would take stored as a binary.
What about other lists?
The pointed values are not copied, only the list items themsleves.
Appending a list of records of n tuples to
a list of m tuples will take the same
time/memory as concatenating a string of length
n to one of length m.
The only difference is that tuples are not boxed in an
Eterm. The list only contains an
Eterm as a pointer to the tuple:
where it's stored in the heap.
Solution?
Of course there is one! You should use
binary and IoList whenever
possible. An IoList is iolist = [char()
| binary() | iolist()]. They are “deep-list”.
Want to append an IoList a to an
IoList b? Simple! Just create a new list with
2 elements: a and b!
Concatenation has become an O(1) operation. Most IO apis
accept IoLists.