Memory - Part 4: Intersec’s custom allocators
malloc()
not the one-size-fits-all allocator
malloc()
is extremely convenient because it is generic.
It does not make any assumptions about the context of the allocation and the
deallocation. Such allocators may just follow each other, or be
separated by a whole job execution. They may take place in the same
thread, or not… Since it is generic, each allocation is different from
each other, meaning that long term allocations share the same pool as
short term ones.
Consequently, the implementation of malloc()
is complex
Since memory can be shared by several threads, the pool must be shared
and locking is required. Since modern hardware has more and more
physical threads, locking the pool at every single allocation would have
disastrous impacts on performance. Therefore, modern malloc()
implementations have thread-local caches and will lock the main pool
only if the caches get too small or too large. A side effect is that
some memory gets stuck in thread-local caches and is not easily
accessible from other threads.
Since chunks of memory can get stuck at different locations (within thread-local caches, in the global pool, or just simply allocated by the process), the heap gets fragmented. It becomes hard to release unused memory to the kernel, and it becomes highly probable that two successive allocations will return chunk of memories that are far from each other, generating random accesses to the heap. As we have seen in the previous article, random access is far from being the optimal solution for accessing memory.
As a consequence, it is sometimes necessary to have specialized allocators with predictable behavior. At Intersec, we have several of them to use in various situations. In some specific use cases we increase performance by several orders of magnitude.
Benchmarks
In order to provide some comparison points, we ran a small synthetic
benchmark. This benchmark tests the performance of malloc()
and free()
in two scenarios. The first scenario is simple: we allocate a 100
million pointers and then we free them all. The allocator’s raw
performance is tested in a single-threaded environment for small
allocations.
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <sys/time.h>
struct list {
struct list *next;
};
static int64_t timeval_diffmsec(const struct timeval *tv2,
const struct timeval *tv1)
{
int64_t delta = tv2->tv_sec - tv1->tv_sec;
return delta * 1000 + (tv2->tv_usec - tv1->tv_usec) / 1000;
}
int main(int argc, char *argv[])
{
for (int k = 0; k < 3; k++) {
struct timeval start;
struct timeval end;
struct list *head = NULL;
struct list *tail = NULL;
/* Allocation */
gettimeofday(&start, NULL);
head = tail = malloc(sizeof(struct list));
for (int i = 0; i < 100000000; i++) {
tail->next = malloc(sizeof(struct list));
tail = tail->next;
}
tail->next = NULL;
gettimeofday(&end, NULL);
printf("100,000,000 allocations in %ldms (%ld/s)\n",
timeval_diffmsec(&end, &start),
100000000UL * 1000 / timeval_diffmsec(&end, &start));
/* Deallocation */
gettimeofday(&start, NULL);
while (head) {
struct list *cur = head;
head = head->next;
free(cur);
}
gettimeofday(&end, NULL);
printf("100,000,000 deallocations in %ldms (%ld/s)\n",
timeval_diffmsec(&end, &start),
100000000UL * 1000 / timeval_diffmsec(&end, &start));
}
return 0;
}
The second scenario adds multithreading: after allocating all our pointers, we start freeing them in another thread while allocating new batches of pointers in the main thread. As a consequence, allocation and deallocation are run concurrently in two different threads, creating contention on the allocation pool.
The benchmark was run three times: once with the ptmalloc
(glibc
‘s implementation), another with
tcmalloc
(Google’s implementation), and finally with
jemalloc
(Linux port of FreeBSD implementation).
Scenario | Allocation | Deallocation | Memory | Time | |
---|---|---|---|---|---|
Uncontended | ptmalloc |
1512ms (66M/s) | 1261ms (79M/s) | 2.9GiB | 9.98s |
tcmalloc |
1712ms (58M/s) | 2229ms (44M/s) | 769MiB | 12.10s | |
jemalloc |
3312ms (30M/s) | 4191ms (23M/s) | 784MiB | 22.55s | |
Contended | ptmalloc |
16154ms (6.2M/s) | 15309ms (6.3M/s) | 2.9GiB | 39.18s |
tcmalloc |
2860ms (34M/s) | 6707ms (14M/s) | 1.7GiB | 14.62s | |
jemalloc |
3845ms (26M/s) | 11672ms (8.5M/s) | 2.3GiB | 23.55s |
Indeed, the results depend vastly on the implementation of
malloc()
. While, on a non-contended environment, the
ptmalloc
shows slightly better performances than
tcmalloc
(at the cost of a much larger memory footprint),
tcmalloc
behaves much better in a multithreaded environment.
One allocation batch contains 100M 8-byte pointers, this means it
allocates 800MB (762MiB). As a consequence, in the single threaded case,
the payload is 762MiB. As we can see tcmalloc
is near-optimal
in terms of memory consumption. However, one odd thing about tcmalloc
is that deallocation is slower than allocation: the deallocation thread
is not able to free memory as fast as it is allocated, causing an ever
growing memory footprint if we increase the number of threads run in the
benchmark.
The benchmark is synthetic and stresses only the small-chunk
allocator in an extremely specific use-case. Thus, it should not be
considered as an absolute proof that tcmalloc
is faster in
multithreaded environment while ptmalloc
is faster in single
threaded ones. However, the benchmark shows that there is no perfect
malloc()
implementation and that choosing the right
implementation for your use case may have huge impacts on overall
performance.
Last, but not least, this benchmark shows that you can perform only a
few millions of allocations/deallocations per second. This may seem
quite large, but as soon as you want to process several hundreds of
thousands of events per second, and if each event triggers one or more
allocations, then malloc()
will start being a bottleneck.
Stack allocator
The first (and certainly the most used) custom allocator at Intersec is
the stack allocator. This is a LIFO
allocator, meaning that allocations are deallocated in the reverse
order of their allocation. It mimics the behavior of the program’s stack
since allocations are grouped in frames, and frames are deallocated at
once.
Internals
The stack allocator is an arena-based allocator. It allocates huge blocks and then splits them into smaller chunks.
It keeps track of two key pieces of information about each block:
- the bottom of the stack
- the delimitation of the frames
When an allocation is performed, the bottom of the stack is incremented by the size of the requested memory (plus the alignment requirements and the size of the canaries). If the requested size cannot fit in the current block, a new one is allocated: the allocator will not try to fill the gap left in the previous block.
When a frame is created, a mark is pushed at the bottom of the stack with the position of the beginning of the previous frame. The allocator always knows the position of the beginning of the current frame. That way, removing a frame is extremely fast: the allocator sets the bottom of the stack back to the current frame’s position, then it reloads the previous frame’s position and makes it its current. Additionally, the allocator will list the blocks that were totally freed by the operation and deallocate them.
The deallocation of a frame is done in amortized constant time, it does not depend on the number of allocations made in the frame, but on the number of blocks contained in the frame. Block size is chosen large enough to contain several typical frames, which means that deallocating a frame will most of the time not deallocate any block.
Since allocations and deallocations are done in a strict order, two consecutive allocations will return adjacent memory chunks (unless the new allocation requires a new arena). This helps improving the locality of memory accesses within the program. Moreover, thanks to the allocation ordering, there is very few fragmentation in a stack allocator. As a consequence, the memory pressure of the stack allocator goes down when the actual amount of allocated memory does so.
t_stack
We do have a special stack allocator: the t_stack
. This
is a thread-local singleton instance of the stack allocator. It is used
as a complement of the normal program’s stack. The main advantage of the
t_stack
is that it allows efficient dynamic allocation of
temporary memory. Whenever we want to allocate some memory in a function
and release it at the end of the function, we just use
t_stack
-based allocation.
The frame creation and deletion on the t_stack
are not bound
to a function scope, it is defined using a special macro t_scope
at the beginning of a lexical scope. That macro makes use of the GNU’s
cleanup
attribute to emulate a C++’s RAII behavior in C: it creates the frame
and adds a cleanup handler that will destroy the frame whenever the
lexical scope of its definition is exited
1.
static inline void t_scope_cleanup(const void **frame_ptr)
{
if (unlikely(*unused != mem_stack_pop(&t_pool_g))) {
e_panic("unbalanced t_stack");
}
}
#define t_scope__(n) \
const void *t_scope_##n __attribute__((unused,cleanup(t_scope_cleanup))) \
= mem_stack_push(&t_pool_g)
#define t_scope_(n) t_scope__(n)
#define t_scope t_scope_(__LINE__)
Since the frame allocation/deallocation is controlled by the developer,
the t_stack
is somehow more flexible than the normal stack. Behaviors that are
dangerous or impossible using the stack, such as doing some allocations
within a loop or returning t_stack
-allocated memory, are safe
with the t_stack
. Moreover, since there is no size limitation
(other than the available amount of RAM), the t_stack
can be used
for general purpose allocations as long as the memory lifetime is compatible
with the frame allocation scheme.
The ability to allocate t_stack
memory in a function that
declares no t_scope
is a clear departure from the behavior of the normal stack. With the
standard program stack, a function cannot have side effects on the
stack: when the function exits, it restores the stack to the state it
had when the function started. In order to limit confusion, we do use a
coding convention: when a function can have a side effect on the
t_stack
(that is, when it can allocate memory in a frame
created by one of its callers) its name must be prefixed by t_
.
That way, it’s easy to detect missing t_scope
s: if a function
uses a t_
function but contains no t_scope
, then
either it should be named t_
or the t_scope
has
been inadvertently omitted.
An additional advantage of the t_stack
is that it often
(not always) makes error management easier compared to heap-allocations.
Since the deallocation is automatic at the end of the t_scope
,
there is no need to have special code paths to handle them in case of error.
/* Error handling with heap-allocated memory.
*/
int process_data_heap(int len)
{
/* We need a variable to remember the result. */
int ret;
/* We will need to deallocate memory, so we have to keep the
* pointer in a variable.
*/
byte *data = (byte *)malloc(len * sizeof(byte));
ret = do_something(data, len);
free(data);
return ret;
}
/* Error handling with t_stack-allocated memory.
*/
int process_data_t_stack(int len)
{
/* Associate the function scope to a t_stack frame.
* That way all <code>t_stack</code>-allocated memory within the
* function will be released at exit
*/
t_scope;
return do_something(t_new(byte, len), len);
}
A side-effect of the use of the t_stack
is that many
short-term allocations that would have been done on the heap are now done
on the t_stack
. This reduces the fragmentation of the heap.
And since the t_stack
is thread-local, it is not subject to
contentions.
The t_stack
relies on non-standard extensions of the C
language and is a bit of magic for newcomers at Intersec, but it
certainly is one of the best addition to the language standard library
we do have at Intersec.
Benchmark
We ran our benchmark on the stack allocator:
Scenario | Allocation | Deallocation | Memory | time |
---|---|---|---|---|
Uncontended | 833ms (120M/s) | 0ms | 1.5GiB | 2.99s |
As you can see, the allocator is fast: it outperforms both
ptmalloc
and tcmalloc
in their optimal cases. Thanks to the frame mechanism, the deallocation
does not depend at all on the number of allocations (the benchmark
could be improved to measure the frame creation/destruction
performance).
The current implementation of the stack allocator has a minimum allocation
alignment of __BIGGEST_ALIGNMENT__
,
which is a platform-dependant constant that represents the maximum
alignment requirement imposed by the CPU. On x86_64, that constant is
set to 16 bytes because some instructions (such as SSE
instructions) operate on 16 bytes vectors with a 16 bytes alignment
requirement. This explains why the memory footprint is twice the optimal
one.
FIFO
allocator
The FIFO
problem
Another frequent memory usage pattern is the use of (mostly) FIFO
(first-in, first-out) pipes: meaning that memory is deallocated
(approximatively) in the order which it was allocated. A typical
use-case is buffering request contexts in a network protocol
implementation: each request is associated to a context that is
allocated when the request is emitted and get freed when the reply is
received. Most of the time, requests will be treated in the order they
are emitted (this is not always true, but even long processing time is
short compared to the execution time of this kind of process).
When FIFO
data is allocated directly on the heap, it can
amplify the fragmentation issue since the next deallocated chunk will
probably not be at the end of the heap and thus will create a hole
in it (and since the FIFO
is not alone on the heap, other
allocations will get inserted between two FIFO
elements, making
things even worse).
Solution
For those reasons, we decided to isolate such usage patterns from the remaining of the allocations. Instead of using heap allocations, we use a custom allocator.
This allocator works basically the same way as the the stack
allocator: it consists of an arena of huge blocks of memory used
linearly (a new block is allocated only when the next allocation cannot
fit in the current block). Instead of using a frame-based model, the
FIFO
allocator uses a per-block size-accounting mechanism. Each block
maintains the amount of allocated memory it contains. When all the data
contained in a block has been released, the block itself is released.
Blocks are allocated using mmap
in order to ensure they will
not interfere with the heap (and thus will not cause fragmentation).
Since the FIFO
allocator shares the same allocation
pattern (but not the same deallocation pattern) as the stack allocator,
it shares some properties. One of the them is the fact that consecutive
allocations are adjacent. However, the induced locality improvement is
less important due to the usage pattern of FIFO
allocators:
most of the time, this is used to allocate independent elements that will
rarely be used together.
The FIFO
allocator is designed to be used in a
single-threaded environment and thus is not subject to contention issues.
Benchmark
We ran our uncontended benchmark with the FIFO
allocator
in order to compare its performances with malloc()
‘s performances:
Scenario | Allocation | Deallocation | Memory | time |
---|---|---|---|---|
Uncontended | 1100ms (90M/s) | 638ms (156M/s) | 1.5GiB | 5.30s |
As for the stack allocator, the FIFO
allocator outperforms
malloc()
implementations. However it is a bit slower than the stack allocator
because it has to track each allocation independently and may not have
been as optimized as the stack allocator.
Ring allocator
The ring allocator is somehow a mix between the stack and FIFO
allocators. It uses frames to group allocations and allows constant
time deallocation of a large number of allocations while having a
mostly-FIFO
usage pattern. The frames in the ring allocator
are not imbricated in the previous ones, each frame is independent and
self-contained.
In order to allocate memory in the ring allocator, the first step is to create a new frame. This requires that the previous frame has been sealed. Once a frame is opened in the allocator, allocations can be made and will automatically be part of the active frame. When all the allocations have been done, the frame must be sealed. A sealed frame is still alive, that means that the allocations it contains are still accessible, but the frame just cannot accept new allocations. When the frame is not needed anymore, it must be released.
Releasing memory in the ring allocator is thread safe. This makes
that allocator very useful to build messages to be transferred for
processing to a worker thread. In that context it works as a nearly
immediate replacement of the t_stack
for code that need
multi-threaded treatment.
/* Single threaded version, use t_stack.
*/
void do_job(void)
{
t_scope;
job_ctx_t *ctx = t_new(job_ctx_t, 1);
run_job(ctx);
}
/* Multi-threaded version, using ring allocation.
* Note that it uses Apple's block extension.
*/
void do_job(void)
{
const void *frame = r_newframe();
job_ctx_t *ctx = r_new(job_ctx_t, 1);
r_seal();
thr_schedule(^{
run_job(ctx);
r_release(frame);
});
}
In this use case, frames get released in mostly-FIFO
order.
Once again, we ran our benchmark using that allocator. Since the ring allocator is thread-safe, the benchmark covers both the contended and the non-contended cases:
Scenario | Allocation | Deallocation | Memory | time |
---|---|---|---|---|
Uncontended | 861ms (116M/s) | 0ms | 764MiB | 2.82s |
Contended | 862ms (116M/s) | 0ms | 764MiB | 2.83s |
Other custom allocators
This article covers three custom allocators we have at Intersec.
These three allocators are not for a general purpose: they are optimized
for specific usage patterns. Thankfully, most of the time a combination
of these three allocators is enough to avoid most issues we can
encounter with malloc()
such as the lack of locality, locking
contention during allocation and heap fragmentation.
However, sometimes this is not the case and we have no choice but to
build a custom implementation of a general purpose allocator to fit with
our performance requirements. For this reason, we also have a
TLSF
(Two Level Segregate Fit)
based allocator. TLSF
is an allocation algorithm that has been designed for real-time
processing, it is guaranteed to operate in constant time (for both
allocation and deallocation).
More anecdotal, we also have some page allocators and a persistent allocator. This last one will probably be covered in a future article.
Next: debugging tools
In the final article of this series, we will cover tools to deal with memory issues.
- before
introducing the
t_scope
macro, developers had to explicitl yt_push()
andt_pop()
frames, but this was too dangerous and led to several bugs due to missingt_pop()
(mostly in error handling cases), themselves leading to memory leaks. The introduction of thet_scope
, even if it is not pure C, allowed much safer code since the developers do not have to care about the error cases, and thet_scope
also automatically checks the push/pop balancing. [↩]