Memory – Part 3: Managing memory

Developer point of view

In the previous articles we dealt with memory classification and analysis from an outer point of view. We saw that memory can be allocated in different ways with various properties. In the remaining articles of the series we will take a developer point of view.

At Intersec we write all of our software in C, which means that we are constantly dealing with memory management. We want our developers to have a solid knowledge of the various existing memory pools. In this article we will have an overview of the main sources of memory available to C programmers on Linux. We will also see some rules of memory management that will help you keep your program correct and efficient.

Locality

We talked a lot about memory pages which are the allocation unit in the kernel (and in the hardware). The CPU uses a thinner addressing unit: the cache line. A cache line is usually 64 bytes long. It’s the size the CPU will fetch from the main memory to its various internal caches.

Old CPUs had no cache, but CPU performances evolved more quickly that memory bus performances. As a consequence, to avoid spending too much time fetching data from the main memory, the CPU gained some small amounts of memory included directly in the chip. Initially there was a single small cache, but modern CPU may have up to three levels of cache. The closer the cache is to the CPU the faster it is. The farther the cache is from the CPU, the larger it gets. Here is a small table with the order of magnitude of the size and access time for each cache level on an i5-750 Core from 20101:

Level Size Expected CPU Cycles Observed CPU Cycles Access Time
Register A few words 0 0 0 ns
Cache L1d 32KiB 4 2 0.8 ns
Cache L2 256KiB 11 5 1.9 ns
Cache L3 8MiB 31 38 15 ns
Main Memory Several GiB 100+ 40+ ns
Rotating Disk Hundreds of GiB 10M+ 5M+ ns

In order to avoid losing CPU time fetching data from memory, you must try to reuse the memory that is already in a cache. This is the the locality principle.

We talk about spatial and temporal locality:

  • Spatial locality: organize your program so that variables accessed together are physically grouped.
  • Temporal locality: organize your code so that accesses to the same location are done close to each other.

Since the memory is fetched by cache lines, a good practice is to group variables that often get accessed together in the same cache line2. Some tools such as pahole let you inspect the layout of your structures:

CPUs also have access pattern detection primitives. They are used to prefetch memory if the CPU guesses it will be probably accessed in a short time. When prefetching is done appropriately, it avoids more latency since memory is already fetched when the CPU actually needs it.

You can find more details in the already cited What every programmer should know about memory.

Ownership

Memory management requires good habits. When a program deals with some memory, it must know which component is responsible for that memory and it should not access it once that component has decided the memory is not valid anymore.

This implies that, for each memory chunk, the developer must maintain two properties:

  • Ownership: who is responsible for that memory?
  • Lifetime: when does that memory get invalidated?

These properties may be dealt with in different ways. First, this can be done implicitly: some structures may always own the memory they point to. This implicit contract can be part of a coding convention or documented in a function or structure prototype.

Secondly, this can be done explicitly: the pointer can be associated with a flag (or another variable) indicating whether the pointer owns its memory or not.

This wiping function resets the variables on exit, which avoids keeping a pointer to some freed memory and thus ensure that this memory will not be accessed “by accident” in the future. We can still dereference the NULL pointer, but this will lead to an immediate error instead of risking corrupting the allocation pool. BTW, this ensures the wiping function is idempotent which makes memory cleanup code simpler.

A proper ownership tracking avoids issues such as use-after-free, double free, leaked memory, …

Memory pools

A quick vocabulary note: while in the literature pool and arena are often used as synonyms, in this and further articles, we will use these terms for two different concepts. Pools are conceptual sources of data, while arenas are large chunks of memory intended to be split into smaller chunks by a memory allocator.

To be able to track the lifetime of a particular memory chunk, developers have to know its originating memory pool. Several standard pools exist. The next paragraphs will detail the most important ones.

Static pool

Owner The runtime (dynamic loader)
Lifetime The same as the file that holds the data
Initialization Explicit value or 0

Static memory is memory allocated at program start by the runtime, or when a dynamic library is loaded. It contains the global variables, whatever their visibility is (extern, static, static in a function…), as well as all the constants (this includes literal strings and exported const variables). In C, a global variable is always initialized to 0 unless explicitly initialized. As a consequence, a large amount of the static memory is probably going to be initialized to 0 when the program starts.

The content of the static memory is extrapolated from the executable files. On Linux, most executables use the ELF format. That format is a succession of sections, each section having a particular purpose: holding code, data, or various meta-data about the binary (debug information, dynamic linking pointers, …). When the executable is loaded, those sections are selectively loaded in memory according to some flags. Among the few tens of standard sections, only a few will be of interest for this article.

First of all, the .text section. That section contains the main code of the binary file and is completed by a few other ones containing special-purposed executable code (such as the .init section that contains initialization code executed when the file is loaded, and the .fini section that contains termination code executed when the file is unloaded). Those sections are mapped in memory with the PROT_EXEC mode. If you recall the previous article, those sections are easily identifiable in pmap‘s output:

Then come the sections that hold some data. There are three of them: .data, .rodata and .bss. The first one contains the mutable variables with an explicit initialization value, the second one contains the constants (and as a consequence it is supposed to be read-only), and the last one is dedicated to uninitialized data. The exact repartition of the data among those three sections depends on your compiler. Most of the time we can observe that:

  • variables and constants initialized to 0 are put in the .bss section.
  • remaining variables are put in the .data section.
  • remaining constants are put in the .rodata section.

Usually, the .rodata is placed just after the executable sections since neither of them are writable. This allows the read-only data and the executable code to be mapped at once and thus simplifies the initialization process. This is mostly noticeable when the executable section is small.

In a similar fashion, since both .data and .bss are writable sections, the latter usually directly follows the former. .bss has the additional property of being sparse: since it only contains uninitialized data that will be filled with zeros, the size of the section is enough to define it. As a consequence, if needed3, the .bss section is built as a read-write anonymous mapping directly following the read-write mapping of the .data section:

The various sections have no reason to start or end exactly at a page limit. As a consequence, when the .bss section spreads over the end of the map made for the .data section, that memory is overridden with zeros in order to conform to the .bss initialization value4. The .bss section is then completed with the anonymous memory.

As you can see, the loading of binary files is not straightforward. An ELF executable is in fact loaded as several mappings corresponding to the various chunks of memory that must be made available at runtime.

Note that this description is not exhaustive and as you can see we didn’t cover all the sections mapped in memory.

Stack

Owner The runtime (the activation record)
Lifetime The scope in which it is declared
Initialization None (random content)

The stack is the place where functions put their contexts. It is composed of a succession of frames (also called activation records) each frame being the context of a different function.

Allocation

The stack is memory region allocated when a thread starts. The allocation of the stack of the main thread is a bit different from the additional ones. Using pmap you can identify the location of the main stack (it appears as [stack]), however the additional stacks are shown as [anon] (the information is however available in /proc/[pid]/maps as [stack:pid_of_the_thread]).

The main thread’s stack is dynamically managed by the kernel; the initial size of the map is not definitive, it will automatically grow as the stack grows. The additional stacks are allocated at thread initialization as a private anonymous map of the stacksize attribute of pthread_attr. The stacks have a limited size, however the default limit on recent Linux distributions is several mega-bytes, which can typically store several thousands of frames. That maximum size is defined as a resource limit and can be manipulated at different levels on the system: through the /etc/limits.conf file, using the ulimit command-line utility or by calling setrlimit().

When the content of the stack does not fit in its allowed size, the process crashes with a segmentation fault due to a stack overflow. In the case of the main thread, this is dynamically handled by the kernel, while additional threads’ stacks are preceded5 by a small no-access region, so that any access there is illegal and generates the segmentation fault. As a consequence, a deeply recursive function can cause a stack overflow, since each recursion creates a new frame.

Each time a function is called, a new frame is pushed by putting the parameters and various state information on the top of the stack. When the function returns, its frame is popped and the stack is set back to its previous state. The exact size and content of a frame depend on the ABI and thus on the operating system, the architecture of the computer and even on the compilation options. However, it generally grows with the number of parameters and the number of local variables of the callee.

Traditionally, the size of the frame was statically defined by the compiler. However, dynamic allocations are allowed in a non-standard fashion using the alloca(3) call. Each call to alloca extends the current frame. There is no call to release or resize an alloca allocation. Since the size of the stack is limited, alloca must be handled with care to avoid stack overflows6. Putting a call to alloca in a loop or having a call to alloca with a dynamically defined size (that may be large) are bad ideas.

Lifetime

A common (wrong) belief is that each local variable of a function has its own memory slot allocated on the stack. However, even if this may be true when the program is compiled with no optimization, this is usually wrong. The compiler job is to ensure that the resulting process is as fast as possible. As a consequence, it will try to keep the data in the registers of the CPU whenever possible in order to reduce the access latency, but since the number of registers is limited, when there are too many variables (or when we need a pointer to one of them), some must be put on the stack. The compiler analyzes the accesses to each variable and perform a register and stack allocation pass based on the result.

The result of that pass is that two variables that are never used at the same time may share the same register and/or the same location on the stack. A variable may also not always be assigned to the same register or memory location. Thus you must keep in mind that even if the memory on the stack remains valid as long as the frame is active (that is as long as the function didn’t return), a particular memory location can be used by several local variables belonging to different scopes:

As a consequence, a pointer to a variable placed on the stack must not exit the lexical scope of that particular variable and a fortiori it must not be propagated as a returned value of the function. The compiler is able to report some broken usage of pointers to stacked variables, but that’s only a small part of the possible errors. So you must be very careful when using pointers to variables on the stack, and especially ensure that the pointers never escape (that is, they never get returned or stored in a global variables or other variables with a longer lifetime) their validity scopes.

Heap

Owner The caller of brk()
Lifetime Until another brk() invalidates the memory
Initialization 0

The heap is a special memory area that has a fixed location and can be grown. It is managed using the brk() and sbrk() standard calls. Those calls let you add or remove memory at the end of the heap with the granularity of a page. Traditionally these calls were supposed to grow the data segment (that is increasing the map corresponding to the .bss section). However, this was at a time where memory management was much less advanced and memory was much more constrained. Nowadays, the virtual address space is large and there is no need to keep it packed anymore.

As you can see in the content of /proc/[pid]/maps, on modern operating systems, the [heap] is still locating near the beginning of the memory address space, after the mapping of the executable (and thus after the .bss section), but there is a large gap between the .bss section and the heap:

As for the stack, the size of the heap is subject to a resource limitation.

In practice, you will never have to manage the heap manually. It is used internally by most malloc() implementations.

mmap()

Owner The caller
Lifetime Until corresponding munmaps
Initialization File content or 0

The mmap system call is the primitive to interact with the virtual memory manager of the kernel. It reserves any of the categories of memory we defined in the first article with the granularity of the page. That granularity does not forbid you to map a file which size is not a multiple of the page size, the remaining area is the last page will simply be filled with zeros (which won’t be added to disk)7.

When calling mmap you manipulate the properties of an internal lookup structure of the kernel, you don’t actually allocate a chunk of memory. As a consequence running mmap several times on the same page (which is possible using the MAP_FIXED flag) won’t create a new map, it will just change the attributes of the pages at the specified address. munmap, the mmap counterpart, removes a particular set of pages from the kernel lookup structure. There is no need to call mmap and munmap in a symetric fashion: you can both unmap several maps with a single large munmap or unmap a single map with several smaller munmap.

A small example to illustrate this all:

malloc()

Owner The caller
Lifetime Until corresponding free()
Initialization None (random content)

malloc() is probably the most widely known source of memory in C. It is the standard dynamic allocator with its siblings calloc(), realloc(), memalign(), free(), … It works by requesting big chunks of memory from the kernel, splitting them into smaller ones that satisfy the size requested by the caller as well as limiting the overhead in both memory and CPU.

The Linux ecosystem is full of different implementations of malloc(). The most notable implementations are probably the ptmalloc family and tcmalloc from Google. The glibc uses the ptmalloc2 implementation which itself is an adaptation of Doug Lea’s malloc (dlmalloc). That version uses a somehow complex algorithm which varies depending on the size of the requested allocation. It uses two sources of memory. For small allocations, it uses the heap while for big ones, it will fallback on mmap(). The default threshold is dynamically estimated and is by default between 128KiB and 64MiB. This can be changed using the options M_MMAP_THRESHOLD and M_MMAP_MAX of the mallopt() call.

The allocator splits the heap into some lists of non-allocated memory chunks. When an allocation is requested, it traverses its lists looking for the more suitable chunk. When no chunk can be found, it grows the heap by issuing a brk() system call and put the additional memory in its internal lists.

When a chunk of memory is free’d (by calling free() on the pointer previously returned by the malloc() call), it is added in a free-chunk list. When the last pages of the heap are free, those pages are released and the heap is reduced (by issuing another brk() call). As a consequence, calling free() in a program may not have a direct impact of the actual memory used by the process. If the program performs a lot of allocation / deallocation with various lifetime, the heap may be fragmented: the heap cannot be reduced even if it contains a lot of free memory. Nevertheless, calling free() on a big chunk of memory (big enough for being allocated using the fallback on mmap()) will call munmap and thus will have a direct impact of the memory consumption of your process.

ptmalloc2 has a memory overhead of at least 8 bytes per allocation on 64-bit system. These bytes are used internally to store some meta-data about the allocation. This means it is extremely memory-inefficient for very small allocations (but this is the case for most malloc() implementations). Those administrative bytes are stored just before the returned pointer, which means they can easily be inadvertently overwritten by a memory underflow. The consequence of such an overwrite is a corruption of the allocator that will later lead to a segmentation fault ((Later means that the crash will seem totally random on a first sight)).

Nowadays, a challenge of a good malloc() implementation is to behave properly in a multithreaded environment. Most of the time, this is done by having thread-local caches and a shared heap. The allocator will first try to find a matching chunk in the local cache and fallback to the heap (which requires locking) only if no matching element can be found.

Next: Intersec’s custom allocators

After spending three full (and quite long) articles to lay the foundation for memory understanding, in the next post we will cover some of the specific implementations we chose to develop at Intersec to match our own requirements.


  1. the size of the cache can be found in the CPU specification, the access times have been computed using a small program that traverses randomly a memory range on various memory sizes. Gaps clearly appear in the resulting data-set when the size of the memory range crosses the size of a cache. 

  2. You should also avoid mixing in the same cache line variables that are mostly read-only with variables that are mostly written in order to avoid  cache contention in concurrent programming

  3. If it is really small and fit in the mapping of the .data section, it just won’t have a dedicated mapping 

  4. since the map is private, the zeros written there are not visible to other processes 

  5. The frames are stacked from the end to the beginning of the memory region. 

  6. Coding conventions will often forbid alloca() because it is both dangerous and non-portable. 

  7. Note that if for some reason you reduce the size of an already mmaped file (using truncate()) without reducing the size of the map accordingly, accessing the the part of the mapping that has no underlying storage anymore will cause a SIGBUS. This, as far as we can tell, is the most common situation that can lead to a SIGBUS on Linux. 

  • Nathan Kurz

    Great overview! Resubmitted to HN with a different title to see if it can get more of the attention it deserves. I hadn’t known about pahole, sounds useful. The only suggestion I have is that it might be good to also refer to Doug Lea’s Malloc as ‘dlmalloc’ so people make the association when they see it. Eager to see your follow up. In case you want to write about it, one thing that has always confused me is why there aren’t many attempts to remap physical pages to create larger contiguous regions for large reallocs. Or is this what brk() is actually doing?

    • Florent Bruneau

      Hi Nathan,

      For large mallocs (those that fallback on a mmap), the glibc’s implementation of realloc() uses the mremap system call when available (this call is non-standard, it is present on Linux and and certain flavors of BSD). That system call guarantees a copy-less reallocation (it will just change the physical < -> virtual page mapping).

  • Pingback: Intersec内存系列-第1部分:了解进程的内存 | IDF实验室·博译有道