Memory – Part 4: Intersec’s custom allocators

malloc() is not the one-size-fits-all allocator

malloc() is extremely convenient because it is generic. It does not make any assumptions about the context of the allocation and the deallocation. Such allocators may just follow each other, or be separated by a whole job execution. They may take place in the same thread, or not… Since it is generic, each allocation is different from each other, meaning that long term allocations share the same pool as short term ones.

Consequently, the implementation of malloc() is complex. Since memory can be shared by several threads, the pool must be shared and locking is required. Since modern hardware has more and more physical threads, locking the pool at every single allocation would have disastrous impacts on performance. Therefore, modern malloc() implementations have thread-local caches and will lock the main pool only if the caches get too small or too large. A side effect is that some memory gets stuck in thread-local caches and is not easily accessible from other threads.

Since chunks of memory can get stuck at different locations (within thread-local caches, in the global pool, or just simply allocated by the process), the heap gets fragmented. It becomes hard to release unused memory to the kernel, and it becomes highly probable that two successive allocations will return chunk of memories that are far from each other, generating random accesses to the heap. As we have seen in the previous article, random access is far from being the optimal solution for accessing memory.

As a consequence, it is sometimes necessary to have specialized allocators with predictable behavior. At Intersec, we have several of them to use in various situations. In some specific use cases we increase performance by several orders of magnitude.

Benchmarks

In order to provide some comparison points, we ran a small synthetic benchmark. This benchmark tests the performance of malloc() and free() in two scenarios. The first scenario is simple: we allocate a 100 million pointers and then we free them all. The allocator’s raw performance is tested in a single-threaded environment for small allocations.

The second scenario adds multithreading: after allocating all our pointers, we start freeing them in another thread while allocating new batches of pointers in the main thread. As a consequence, allocation and deallocation are run concurrently in two different threads, creating contention on the allocation pool.

malloc-benchmark

The benchmark was run three times: once with the ptmalloc (glibc‘s implementation), another with tcmalloc (Google’s implementation), and finally with jemalloc (Linux port of FreeBSD implementation).

Scenario Allocation Deallocation Memory Time
Uncontended ptmalloc 1512ms (66M/s) 1261ms (79M/s) 2.9GiB 9.98s
tcmalloc 1712ms (58M/s) 2229ms (44M/s) 769MiB 12.10s
jemalloc 3312ms (30M/s) 4191ms (23M/s) 784MiB 22.55s
Contended ptmalloc 16154ms (6.2M/s) 15309ms (6.3M/s) 2.9GiB 39.18s
tcmalloc 2860ms (34M/s) 6707ms (14M/s) 1.7GiB 14.62s
jemalloc 3845ms (26M/s) 11672ms (8.5M/s) 2.3GiB 23.55s

Indeed, the results depend vastly on the implementation of malloc(). While, on a non-contended environment, the ptmalloc shows slightly better performances than tcmalloc (at the cost of a much larger memory footprint), tcmalloc behaves much better in a multithreaded environment.

One allocation batch contains 100M 8-byte pointers, this means it allocates 800MB (762MiB). As a consequence, in the single threaded case, the payload is 762MiB. As we can see tcmalloc is near-optimal in terms of memory consumption. However, one odd thing about tcmalloc is that deallocation is slower than allocation: the deallocation thread is not able to free memory as fast as it is allocated, causing an ever growing memory footprint if we increase the number of threads run in the benchmark.

The benchmark is synthetic and stresses only the small-chunk allocator in an extremely specific use-case. Thus, it should not be considered as an absolute proof that tcmalloc is faster in multithreaded environment while ptmalloc is faster in single threaded ones. However, the benchmark shows that there is no perfect malloc() implementation and that choosing the right implementation for your use case may have huge impacts on overall performance.

Last, but not least, this benchmark shows that you can perform only a few millions of allocations/deallocations per second. This may seem quite large, but as soon as you want to process several hundreds of thousands of events per second, and if each event triggers one or more allocations, then malloc() will start being a bottleneck.

Stack allocator

The first (and certainly the most used) custom allocator at Intersec is the stack allocator. This is a LIFO allocator, meaning that allocations are deallocated in the reverse order of their allocation. It mimics the behavior of the program’s stack since allocations are grouped in frames, and frames are deallocated at once.

Internals

The stack allocator is an arena-based allocator. It allocates huge blocks and then splits them into smaller chunks.

It keeps track of two key pieces of information about each block:

  • the bottom of the stack
  • the delimitation of the frames

When an allocation is performed, the bottom of the stack is incremented by the size of the requested memory (plus the alignment requirements and the size of the canaries). If the requested size cannot fit in the current block, a new one is allocated: the allocator will not try to fill the gap left in the previous block.

When a frame is created, a mark is pushed at the bottom of the stack with the position of the beginning of the previous frame. The allocator always knows the position of the beginning of the current frame. That way, removing a frame is extremely fast: the allocator sets the bottom of the stack back to the current frame’s position, then it reloads the previous frame’s position and makes it its current. Additionally, the allocator will list the blocks that were totally freed by the operation and deallocate them.

mem-stack

The deallocation of a frame is done in amortized constant time, it does not depend on the number of allocations made in the frame, but on the number of blocks contained in the frame. Block size is chosen large enough to contain several typical frames, which means that deallocating a frame will most of the time not deallocate any block.

Since allocations and deallocations are done in a strict order, two consecutive allocations will return adjacent memory chunks (unless the new allocation requires a new arena). This helps improving the locality of memory accesses within the program. Moreover, thanks to the allocation ordering, there is very few fragmentation in a stack allocator. As a consequence, the memory pressure of the stack allocator goes down when the actual amount of allocated memory does so.

t_stack

We do have a special stack allocator: the t_stack. This is a thread-local singleton instance of the stack allocator. It is used as a complement of the normal program’s stack. The main advantage of the t_stack is that it allows efficient dynamic allocation of temporary memory. Whenever we want to allocate some memory in a function and release it at the end of the function, we just use t_stack-based allocation.

The frame creation and deletion on the t_stack are not bound to a function scope, it is defined using a special macro t_scope at the beginning of a lexical scope. That macro makes use of the GNU’s cleanup attribute to emulate a C++’s RAII behavior in C: it creates the frame and adds a cleanup handler that will destroy the frame whenever the lexical scope of its definition is exited1.

Since the frame allocation/deallocation is controlled by the developer, the t_stack is somehow more flexible than the normal stack. Behaviors that are dangerous or impossible using the stack, such as doing some allocations within a loop or returning t_stack-allocated memory, are safe with the t_stack. Moreover, since there is no size limitation (other than the available amount of RAM), the t_stack can be used for general purpose allocations as long as the memory lifetime is compatible with the frame allocation scheme.

The ability to allocate t_stack memory in a function that declares no t_scope is a clear departure from the behavior of the normal stack. With the standard program stack, a function cannot have side effects on the stack: when the function exits, it restores the stack to the state it had when the function started. In order to limit confusion, we do use a coding convention: when a function can have a side effect on the t_stack (that is, when it can allocate memory in a frame created by one of its callers) its name must be prefixed by t_. That way, it’s easy to detect missing t_scopes: if a function uses a t_ function but contains no t_scope, then either it should be named t_ or the t_scope has been inadvertently omitted.

An additional advantage of the t_stack is that it often (not always) makes error management easier compared to heap-allocations. Since the deallocation is automatic at the end of the t_scope, there is no need to have special code paths to handle them in case of error.

A side-effect of the use of the t_stack is that many short-term allocations that would have been done on the heap are now done on the t_stack. This reduces the fragmentation of the heap. And since the t_stack is thread-local, it is not subject to contentions.

The t_stack relies on non-standard extensions of the C language and is a bit of magic for newcomers at Intersec, but it certainly is one of the best addition to the language standard library we do have at Intersec.

Benchmark

We ran our benchmark on the stack allocator:

Scenario Allocation Deallocation Memory time
Uncontended 833ms (120M/s) 0ms 1.5GiB 2.99s

As you can see, the allocator is  fast: it outperforms both ptmalloc and tcmalloc in their optimal cases. Thanks to the frame mechanism, the deallocation does not depend at all on the number of allocations (the benchmark could be improved to measure the frame creation/destruction performance).

The current implementation of the stack allocator has a minimum allocation alignment of __BIGGEST_ALIGNMENT__, which is a platform-dependant constant that represents the maximum alignment requirement imposed by the CPU. On x86_64, that constant is set to 16 bytes because some instructions (such as SSE instructions) operate on 16 bytes vectors with a 16 bytes alignment requirement. This explains why the memory footprint is twice the optimal one.

FIFO allocator

The FIFO problem

Another frequent memory usage pattern is the use of (mostly) FIFO (first-in, first-out) pipes: meaning that memory is deallocated (approximatively) in the order which it was allocated. A typical use-case is  buffering request contexts in a network protocol implementation: each request is associated to a context that is allocated when the request is emitted and get freed when the reply is received. Most of the time, requests will be treated in the order they are emitted (this is not always true, but even long processing time is short compared to the execution time of this kind of process).

When FIFO data is allocated directly on the heap, it can amplify the fragmentation issue since the next deallocated chunk will probably not be at the end of the heap and thus will create a hole in it (and since the FIFO is not alone on the heap, other allocations will get inserted between two FIFO elements, making things even worse).

Solution

For those reasons, we decided to isolate such usage patterns from the remaining of the allocations. Instead of using heap allocations, we use a custom allocator.

This allocator works basically the same way as the the stack allocator: it consists of an arena of huge blocks of memory used linearly (a new block is allocated only when the next allocation cannot fit in the current block). Instead of using a frame-based model, the FIFO allocator uses a per-block size-accounting mechanism. Each block maintains the amount of allocated memory it contains. When all the data contained in a block has been released, the block itself is released. Blocks are allocated using mmap in order to ensure they will not interfere with the heap (and thus will not cause fragmentation).

Since the FIFO allocator shares the same allocation pattern (but not the same deallocation pattern) as the stack allocator, it shares some properties. One of the them is the fact that consecutive allocations are adjacent. However, the induced locality improvement is less important due to the usage pattern of FIFO allocators: most of the time, this is used to allocate independent elements that will rarely be used together.

The FIFO allocator is designed to be used in a single-threaded environment and thus is not subject to contention issues.

Benchmark

We ran our uncontended benchmark with the FIFO allocator in order to compare its performances with malloc()‘s performances:

Scenario Allocation Deallocation Memory time
Uncontended 1100ms (90M/s) 638ms (156M/s) 1.5GiB 5.30s

As for the stack allocator, the FIFO allocator outperforms malloc() implementations. However it is a bit slower than the stack allocator because it has to track each allocation independently and may not have been as optimized as the stack allocator.

Ring allocator

The ring allocator is somehow a mix between the stack and FIFO allocators. It uses frames to group allocations and allows constant time deallocation of a large number of allocations while having a mostly-FIFO usage pattern. The frames in the ring allocator are not imbricated in the previous ones, each frame is independent and self-contained.

In order to allocate memory in the ring allocator, the first step is to create a new frame. This requires that the previous frame has been sealed. Once a frame is opened in the allocator, allocations can be made and will automatically be part of the active frame. When all the allocations have been done, the frame must be sealed. A sealed frame is still alive, that means that the allocations it contains are still accessible, but the frame just cannot accept new allocations. When the frame is not needed anymore, it must be released.

Releasing memory in the ring allocator is thread safe. This makes that allocator very useful to build messages to be transferred for processing to a worker thread. In that context it works as a nearly immediate replacement of the t_stack for code that need multi-threaded treatment.

In this use case, frames get released in mostly-FIFO order.

Once again, we ran our benchmark using that allocator. Since the ring allocator is thread-safe, the benchmark covers both the contended and the non-contended cases:

Scenario Allocation Deallocation Memory time
Uncontended 861ms (116M/s) 0ms 764MiB 2.82s
Contended 862ms (116M/s) 0ms 764MiB 2.83s

Other custom allocators

This article covers three custom allocators we have at Intersec. These three allocators are not for a general purpose: they are optimized for specific usage patterns. Thankfully, most of the time a combination of these three allocators is enough to avoid most issues we can encounter with malloc() such as the lack of locality, locking contention during allocation and heap fragmentation.

However, sometimes this is not the case and we have no choice but to build a custom implementation of a general purpose allocator to fit with our performance requirements. For this reason, we also have a TLSF (Two Level Segregate Fit) based allocator. TLSF is an allocation algorithm that has been designed for real-time processing, it is guaranteed to operate in constant time (for both allocation and deallocation).

More anecdotal, we also have some page allocators and a persistent allocator. This last one will probably be covered in a future article.

Next: debugging tools

In the final article of this series, we will cover tools to deal with memory issues.


  1. before introducing the t_scope macro, developers had to explicitly t_push() and t_pop() frames, but this was too dangerous and led to several bugs due to missing t_pop() (mostly in error handling cases), themselves leading to memory leaks. The introduction of the t_scope, even if it is not pure C, allowed much safer code since the developers do not have to care about the error cases, and the t_scope also automatically checks the push/pop balancing. 

  • http://coderi.ro Alexandru Cobuz

    Impresive benchmarks, but what kind of RAM are you using? Single channel ?

    • Florent Bruneau

      We’re not using anything special. The benchmarks ran on my 3 years old workstation with 8GB of 1333MHz DDR3.

  • http://blogs.perl.org/users/rurban/ Reini Urban

    Check out ptmalloc3, with _independent_comalloc() where you can mimic arenas for larger structs/arrays.

  • Richard Title

    If you’re often allocating memory blocks of a certain specific size, you can take advantage of that by pre-allocating a pool N times that size, and writing a special-purpose allocator (e.g. malloc64 to allocate 64-byte blocks) that allocates from that pool.

    In C++ this comes up a lot because there tends to be a lot of small object allocations (“new X” is usually a small object allocation of sizeof(X) bytes). If the default “operator new” is based on malloc, this can cause a performance problem because malloc is typically not optimized for many small-object allocations. A C++ programmer can work around this by overriding “operator new” for frequently-allocated classes, leveraging the above idea of providing an allocator specifically customized for allocating sizeof(X) bytes. Chapter 4 of this book (“Small Object Allocation” is the chapter, “Modern Concepts in C++” is the book) explains in detail how to do this: http://www.amazon.com/Modern-Design-Generic-Programming-Patterns/dp/0201704315/ref=sr_1_cc_1?s=aps&ie=UTF8&qid=1380631869&sr=1-1-catcorr&keywords=Modern+Concepts+in+C%2B%2B

    Rich Title

  • Jason Evans

    It’s not obvious to me which version of jemalloc was used in your benchmarks. Was it a modified copy of code taken from a release version of FreeBSD? Or was it a release version of jemalloc from http://www.canonware.com/jemalloc/ ? Similarly, exactly what versions of ptmalloc and tcmalloc did you use?

    • Florent Bruneau

      All the versions used are the packaged versions on debian sid, which means: jemalloc 3.4.0 (from http://www.canonware.com/jemalloc/), ptmalloc 2 (commes with glibc 2.17), tcmalloc 2.1.

  • Sam

    Hi,

    Did you used a software to generate the images of this page ? If yes, could you post a link to theirs webpages please ? (I’m particularly interested in the linked list one).

    Thanks