malloc() is not the one-size-fits-all allocator
malloc() is extremely convenient because it is generic. It does not make any assumptions about the context of the allocation and the deallocation. Such allocators may just follow each other, or be separated by a whole job execution. They may take place in the same thread, or not… Since it is generic, each allocation is different from each other, meaning that long term allocations share the same pool as short term ones.
Consequently, the implementation of
malloc() is complex. Since memory can be shared by several threads, the pool must be shared and locking is required. Since modern hardware has more and more physical threads, locking the pool at every single allocation would have disastrous impacts on performance. Therefore, modern
malloc() implementations have thread-local caches and will lock the main pool only if the caches get too small or too large. A side effect is that some memory gets stuck in thread-local caches and is not easily accessible from other threads.
Since chunks of memory can get stuck at different locations (within thread-local caches, in the global pool, or just simply allocated by the process), the heap gets fragmented. It becomes hard to release unused memory to the kernel, and it becomes highly probable that two successive allocations will return chunk of memories that are far from each other, generating random accesses to the heap. As we have seen in the previous article, random access is far from being the optimal solution for accessing memory.
As a consequence, it is sometimes necessary to have specialized allocators with predictable behavior. At Intersec, we have several of them to use in various situations. In some specific use cases we increase performance by several orders of magnitude.