At Intersec we chose the C programming language because it gives us a full control on what we’re doing, and achieves a high level of performances. For many people, performance is just about using as few CPU instructions as possible. However, on modern hardware it’s much more complicated than just CPU. Algorithms have to deal with memory, CPU, disk and network I/Os… Each of them adds to the cost of the algorithm and each of them must be properly understood in order to guarantee both the performance and the reliability of the algorithm.
The impact of CPU (and as a consequence, the algorithmic complexity) on performances is well understood, as are disk and network latencies. However the memory seems much less understood. As our experience with our customers shows, even the output of widely used tools, such as
top, are cryptic to most system administrators.
This post is the first in a series of five about memory. We will deal with topics such as the definition of memory, how it is managed, how to read the output of tools… This series will address subjects that will be of interest for both developers and system administrators. While most rules should apply to most modern operating systems, we’ll talk more specifically about Linux and the C programming language.
From Virtual to Physical
In the previous article, we introduced a way to classify the memory a process reclaimed. We used 4 quadrants using two axis: private/shared and anonymous/file-backed. We also evoked the complexity of the sharing mechanism and the fact that all memory is basically reclaimed to the kernel.
Everything we talked about was virtual. It was all about reservation of memory addresses, but a reserved address is not always immediately mapped to physical memory by the kernel. Most of the time, the kernel delays the actual allocation of physical memory until the time of the first access (or the time of the first write in some cases)… and even then, this is done with the granularity of a page (commonly 4KiB). Moreover, some pages may be swapped out after being allocated, that means they get written to disk in order to allow other pages to be put in RAM.
As a consequence, knowing the actual size of physical memory used by a process (known as resident memory of the process) is really a hard game… and the sole component of the system that actually knows about it is the kernel (it’s even one of its jobs). Fortunately, the kernel exposes some interfaces that will let you retrieve some statistics about the system or a specific process. This article enters into the depth of the tools provided by the Linux ecosystem to analyze the memory pattern of processes.
Developer point of view
In the previous articles we dealt with memory classification and analysis from an outer point of view. We saw that memory can be allocated in different ways with various properties. In the remaining articles of the series we will take a developer point of view.
At Intersec we write all of our software in C, which means that we are constantly dealing with memory management. We want our developers to have a solid knowledge of the various existing memory pools. In this article we will have an overview of the main sources of memory available to C programmers on Linux. We will also see some rules of memory management that will help you keep your program correct and efficient.
malloc() is not the one-size-fits-all allocator
malloc() is extremely convenient because it is generic. It does not make any assumptions about the context of the allocation and the deallocation. Such allocators may just follow each other, or be separated by a whole job execution. They may take place in the same thread, or not… Since it is generic, each allocation is different from each other, meaning that long term allocations share the same pool as short term ones.
Consequently, the implementation of
malloc() is complex. Since memory can be shared by several threads, the pool must be shared and locking is required. Since modern hardware has more and more physical threads, locking the pool at every single allocation would have disastrous impacts on performance. Therefore, modern
malloc() implementations have thread-local caches and will lock the main pool only if the caches get too small or too large. A side effect is that some memory gets stuck in thread-local caches and is not easily accessible from other threads.
Since chunks of memory can get stuck at different locations (within thread-local caches, in the global pool, or just simply allocated by the process), the heap gets fragmented. It becomes hard to release unused memory to the kernel, and it becomes highly probable that two successive allocations will return chunk of memories that are far from each other, generating random accesses to the heap. As we have seen in the previous article, random access is far from being the optimal solution for accessing memory.
As a consequence, it is sometimes necessary to have specialized allocators with predictable behavior. At Intersec, we have several of them to use in various situations. In some specific use cases we increase performance by several orders of magnitude.
Here we are! We spent 4 articles explaining what memory is, how to deal with it and what are the kind of problems you can expect from it. Even the best developers write bugs. A commonly accepted estimation seems to be around of few tens of bugs per thousand of lines of code, which is definitely quite huge. As a consequence, even if you proficiently mastered all the concepts covered by our articles, you’ll still probably have a few memory-related bugs.
Memory-related bugs may be particularly hard to spot and fix. Let’s take the following program as an example:
#define MAX_LINE_SIZE 32
static const char *build_message(const char *name)
sprintf(message, "hello %s!\n", name);
int main(int argc, char *argv)
fputs(build_message(argc > 1 ? argv : "world"), stdout);
This program is supposed to take a message as argument and print “hello !” (the default message being “world”).
The behavior of this program is completely undefined, it is buggy, however it will probably not crash. The function
build_message returns a pointer to some memory allocated in its stack-frame. Because of how the stack works, that memory is very susceptible to be overwritten by another function call later, possibly by
fputs. As a consequence, if
fputs internally uses sufficient stack-memory to overwrite the message, then the output will be corrupted (and the program may even crash), in the other case the program will print the expected message. Moreover, the program may overflow its buffer because of the use of the unsafe
sprintf function that has no limit in the number of bytes written.
So, the behavior of the program varies depending on the size of the message given in the command line, the value of
MAX_LINE_SIZE and the implementation of
fputs. What’s annoying with this kind of bug is that the result may not be obvious: the program “works” well enough with simple use cases and will only fail the day it will receive a parameter with the right properties to exhibit the issue. That’s why it’s important that developers are at ease with some tools that will help them to validate (or to debug) memory management.
In this last article, we will cover some free tools that we consider should be part of the minimal toolkit of a C (and C++) developer.
The most used custom allocators at Intersec are the FIFO and the Stack allocators, detailed in a previous article. The stack allocator is extremely convenient, thanks to the
t_scope macro, and the FIFO is well fitted to some of our use cases, such as inter-process communication. It is thus important for these allocators to be optimized extensively.
We are two interns at Intersec, and our objective for this 6 week internship was to optimize these allocators as far as possible. Optimizing an allocator can have several meanings: it can be in terms of memory overhead, resistance to contention, performance… As the FIFO allocator is designed to work in single threaded environments, and the
t_stack is thread local, we will only cover performance and memory overhead.