Memory - Part 3: Managing memory
Developer point of view
In the previous articles we dealt with memory classification and analysis from an outer point of view. We saw that memory can be allocated in different ways with various properties. In the remaining articles of the series we will take a developer point of view.
At Intersec we write all of our software in C, which means that we are constantly dealing with memory management. We want our developers to have a solid knowledge of the various existing memory pools. In this article we will have an overview of the main sources of memory available to C programmers on Linux. We will also see some rules of memory management that will help you keep your program correct and efficient.
Locality
We talked a lot about memory pages which are the allocation unit in the kernel (and in the hardware). The CPU uses a thinner addressing unit: the cache line. A cache line is usually 64 bytes long. It’s the size the CPU will fetch from the main memory to its various internal caches.
Old CPUs had no cache, but CPU performances evolved more quickly that memory bus performances. As a consequence, to avoid spending too much time fetching data from the main memory, the CPU gained some small amounts of memory included directly in the chip. Initially there was a single small cache, but modern CPU may have up to three levels of cache. The closer the cache is to the CPU the faster it is. The farther the cache is from the CPU, the larger it gets. Here is a small table with the order of magnitude of the size and access time for each cache level on an i5-750 Core from 2010 1:
Level | Size | Expected CPU Cycles | Observed CPU Cycles | Access Time |
---|---|---|---|---|
Register | A few words | 0 | 0 | 0 ns |
Cache L1d | 32KiB | 4 | 2 | 0.8 ns |
Cache L2 | 256KiB | 11 | 5 | 1.9 ns |
Cache L3 | 8MiB | 31 | 38 | 15 ns |
Main Memory | Several GiB | 100+ | 40+ ns | |
Rotating Disk | Hundreds of GiB | 10M+ | 5M+ ns |
In order to avoid losing CPU time fetching data from memory, you must try to reuse the memory that is already in a cache. This is the the locality principle.
We talk about spatial and temporal locality:
- Spatial locality: organize your program so that variables accessed together are physically grouped.
- Temporal locality: organize your code so that accesses to the same location are done close to each other.
Since the memory is fetched by cache lines, a good practice is to
group variables that often get accessed together in the same cache line
2. Some tools such as pahole
let you inspect the layout of your structures:
struct mem_stack_pool_t {
const void * start; /* 0 8 */
size_t size; /* 8 8 */
dlist_t blk_list; /* 16 16 */
mem_stack_frame_t base; /* 32 32 */
/* --- cacheline 1 boundary (64 bytes) --- */
mem_stack_frame_t * stack; /* 64 8 */
size_t stacksize; /* 72 8 */
uint32_t minsize; /* 80 4 */
uint32_t nbpages; /* 84 4 */
uint32_t nbpops; /* 88 4 */
uint32_t alloc_nb; /* 92 4 */
size_t alloc_sz; /* 96 8 */
mem_pool_t funcs; /* 104 24 */
/* --- cacheline 2 boundary (128 bytes) --- */
/* size: 128, cachelines: 2, members: 12 */
};
CPUs also have access pattern detection primitives. They are used to prefetch memory if the CPU guesses it will be probably accessed in a short time. When prefetching is done appropriately, it avoids more latency since memory is already fetched when the CPU actually needs it.
You can find more details in the already cited What every programmer should know about memory.
Ownership
Memory management requires good habits. When a program deals with some memory, it must know which component is responsible for that memory and it should not access it once that component has decided the memory is not valid anymore.
This implies that, for each memory chunk, the developer must maintain two properties:
- Ownership: who is responsible for that memory?/li>
- Lifetime: when does that memory get invalidated?
These properties may be dealt with in different ways. First, this can be done implicitly: some structures may always own the memory they point to. This implicit contract can be part of a coding convention or documented in a function or structure prototype.
sstruct im_ptr_t {
/* This pointer is owned by the structure */
void *ptr;
};
void im_ptr_wipe(struct im_ptr_t *ptr)
{
free(ptr->ptr);
ptr->ptr = NULL;
}
Secondly, this can be done explicitly: the pointer can be associated with a flag (or another variable) indicating whether the pointer owns its memory or not.
sstruct ex_ptr_t {
bool own;
void *ptr;
};
void ex_ptr_wipe(struct ex_ptr_t *ptr)
{
if (ptr->own) {
free(ptr->ptr);
}
ptr->own = false;
ptr->ptr = NULL;
}
This wiping function resets the variables on exit, which avoids
keeping a pointer to some freed memory and thus ensure that this memory
will not be accessed “by accident” in the future. We can still
dereference the NULL
pointer, but this will lead to an
immediate error instead of risking corrupting the allocation pool. BTW,
this ensures the wiping function is idempotent which makes memory
cleanup code simpler.
A proper ownership tracking avoids issues such as use-after-free, double free, leaked memory, …
Memory pools
A quick vocabulary note: while in the literature pool and arena are often used as synonyms, in this and further articles, we will use these terms for two different concepts. Pools are conceptual sources of data, while arenas are large chunks of memory intended to be split into smaller chunks by a memory allocator.
To be able to track the lifetime of a particular memory chunk, developers have to know its originating memory pool. Several standard pools exist. The next paragraphs will detail the most important ones.
Static pool
Owner | The runtime (dynamic loader) |
---|---|
Lifetime | The same as the file that holds the data |
Initialization | Explicit value or 0 |
Static memory is memory allocated at program start by the runtime, or
when a dynamic library is loaded. It contains the global variables,
whatever their visibility is (extern, static, static in a function…), as
well as all the constants (this includes literal strings and exported
const
variables).
In C, a global variable is always initialized to 0 unless explicitly
initialized. As a consequence, a large amount of the static memory is
probably going to be initialized to 0 when the program starts.
The content of the static memory is extrapolated from the executable
files. On Linux, most executables use the ELF
format. That format is a succession of sections, each section having a
particular purpose: holding code, data, or various meta-data about the
binary (debug information, dynamic linking pointers, …). When the
executable is loaded, those sections are selectively loaded in memory
according to some flags. Among the few tens of standard sections, only a
few will be of interest for this article.
First of all, the .text
section. That section contains
the main code of the binary file and is completed by a few other ones
containing special-purposed executable code (such as the .init
section that contains initialization code executed when the file is loaded,
and the .fini
section that contains termination code executed
when the file is unloaded). Those sections are mapped in memory with the
PROT_EXEC
mode. If you recall the previous article, those
sections are easily identifiable in pmap
‘s output:
3009: ./blah
0000000000400000 4K r-x-- /home/fruneau/blah
0000000000401000 4K rw--- /home/fruneau/blah
00007fbb5da87000 51200K rw-s- /dev/zero (deleted)
00007fbb60c87000 1536K r-x-- /lib/x86_64-linux-gnu/libc-2.13.so
00007fbb60e07000 2048K ----- /lib/x86_64-linux-gnu/libc-2.13.so
00007fbb61007000 16K r---- /lib/x86_64-linux-gnu/libc-2.13.so
00007fbb6100b000 4K rw--- /lib/x86_64-linux-gnu/libc-2.13.so
00007fbb6100c000 20K rw--- [ anon ]
00007fbb61011000 128K r-x-- /lib/x86_64-linux-gnu/ld-2.13.so
00007fbb61221000 12K rw--- [ anon ]
00007fbb6122e000 8K rw--- [ anon ]
00007fbb61230000 4K r---- /lib/x86_64-linux-gnu/ld-2.13.so
00007fbb61231000 4K rw--- /lib/x86_64-linux-gnu/ld-2.13.so
00007fbb61232000 4K rw--- [ anon ]
00007fff9350f000 132K rw--- [ stack ]
00007fff9356e000 4K r-x-- [ anon ]
ffffffffff600000 4K r-x-- [ anon ]
total 55132K
Then come the sections that hold some data. There are three of them:
.data
, .rodata
and .bss
.
The first one contains the mutable variables with an explicit
initialization value, the second one contains the constants (and as a
consequence it is supposed to be read-only), and the last one is
dedicated to uninitialized data. The exact repartition of the data among
those three sections depends on your compiler. Most of the time we can
observe that:
- variables and constants initialized to 0
are put in the
.bss
section. - remaining variables are put in the
.data
section. - remaining constants are put in the
.rodata
section.
Usually, the .rodata
is placed just after the executable
sections since neither of them are writable. This allows the read-only
data and the executable code to be mapped at once and thus simplifies
the initialization process. This is mostly noticeable when the
executable section is small.
3009: ./blah
0000000000400000 4K r-x-- /home/fruneau/blah
0000000000401000 4K rw--- /home/fruneau/blah
00007fbb5da87000 51200K rw-s- /dev/zero (deleted)
00007fbb60c87000 1536K r-x-- /lib/x86_64-linux-gnu/libc-2.13.so
00007fbb60e07000 2048K ----- /lib/x86_64-linux-gnu/libc-2.13.so
00007fbb61007000 16K r---- /lib/x86_64-linux-gnu/libc-2.13.so
00007fbb6100b000 4K rw--- /lib/x86_64-linux-gnu/libc-2.13.so
00007fbb6100c000 20K rw--- [ anon ]
00007fbb61011000 128K r-x-- /lib/x86_64-linux-gnu/ld-2.13.so
00007fbb61221000 12K rw--- [ anon ]
00007fbb6122e000 8K rw--- [ anon ]
00007fbb61230000 4K r---- /lib/x86_64-linux-gnu/ld-2.13.so
00007fbb61231000 4K rw--- /lib/x86_64-linux-gnu/ld-2.13.so
00007fbb61232000 4K rw--- [ anon ]
00007fff9350f000 132K rw--- [ stack ]
00007fff9356e000 4K r-x-- [ anon ]
ffffffffff600000 4K r-x-- [ anon ]
total 55132K
In a similar fashion, since both .data
and .bss
are writable sections, the latter usually directly follows the former.
.bss
has the additional property of being sparse: since it only
contains uninitialized data that will be filled with zeros, the size of the
section is enough to define it. As a consequence, if
needed3, the .bss
section is built as a read-write anonymous mapping directly following the
read-write mapping of the .data
section:
3009: ./blah
0000000000400000 4K r-x-- /home/fruneau/blah
0000000000401000 4K rw--- /home/fruneau/blah
00007fbb5da87000 51200K rw-s- /dev/zero (deleted)
00007fbb60c87000 1536K r-x-- /lib/x86_64-linux-gnu/libc-2.13.so
00007fbb60e07000 2048K ----- /lib/x86_64-linux-gnu/libc-2.13.so
00007fbb61007000 16K r---- /lib/x86_64-linux-gnu/libc-2.13.so
00007fbb6100b000 4K rw--- /lib/x86_64-linux-gnu/libc-2.13.so
00007fbb6100c000 20K rw--- [ anon ]
00007fbb61011000 128K r-x-- /lib/x86_64-linux-gnu/ld-2.13.so
00007fbb61221000 12K rw--- [ anon ]
00007fbb6122e000 8K rw--- [ anon ]
00007fbb61230000 4K r---- /lib/x86_64-linux-gnu/ld-2.13.so
00007fbb61231000 4K rw--- /lib/x86_64-linux-gnu/ld-2.13.so
00007fbb61232000 4K rw--- [ anon ]
00007fff9350f000 132K rw--- [ stack ]
00007fff9356e000 4K r-x-- [ anon ]
ffffffffff600000 4K r-x-- [ anon ]
total 55132K
The various sections have no reason to start or end exactly at a page
limit. As a consequence, when the .bss
section spreads over the
end of the map made for the .data
section, that memory is
overridden with zeros in order to conform to the .bss
initialization value4. The .bss
section is then completed with the anonymous memory.
As you can see, the loading of binary files is not straightforward.
An ELF
executable is in fact loaded as several mappings
corresponding to the various chunks of memory that must be made available
at runtime.
Note that this description is not exhaustive and as you can see we didn’t cover all the sections mapped in memory.
Stack
Owner | The runtime (the activation record) |
---|---|
Lifetime | The scope in which it is declared |
Initialization | None (random content) |
The stack is the place where functions put their contexts. It is composed of a succession of frames (also called activation records) each frame being the context of a different function.
Allocation
The stack is memory region allocated when a thread starts. The
allocation of the stack of the main thread is a bit different from the
additional ones. Using pmap
you can identify the location of the
main stack (it appears as [stack]
), however the additional stacks
are shown as [anon]
(the information is however available in
/proc/[pid]/maps
as [stack:pid_of_the_thread]
).
The main thread’s stack is dynamically managed by the kernel; the
initial size of the map is not definitive, it will automatically grow as
the stack grows. The additional stacks are allocated at thread
initialization as a private anonymous map of the stacksize
attribute of pthread_attr
.
The stacks have a limited size, however the default limit on recent
Linux distributions is several mega-bytes, which can typically store
several thousands of frames. That maximum size is defined as a
resource limit
and can be manipulated at different levels on the system: through
the /etc/limits.conf
file, using the ulimit
command-line utility or by calling setrlimit()
.
00400000-00638000 r-xp 00000000 fe:02 507579 /home/fruneau/zchk
00838000-0084a000 rw-p 00238000 fe:02 507579 /home/fruneau/zchk
0084a000-0085c000 rw-p 00000000 00:00 0
02431000-02866000 rw-p 00000000 00:00 0 [heap]
[...]
7f5c7e8bd000-7f5c7e8be000 ---p 00000000 00:00 0
7f5c7e8be000-7f5c7f0be000 rw-p 00000000 00:00 0 [stack:711]
[...]
7f5c800c0000-7f5c801bd000 r-xp 00000000 08:06 558238 /lib/x86_64-linux-gnu/libm-2.17.so
7f5c801bd000-7f5c803bc000 ---p 000fd000 08:06 558238 /lib/x86_64-linux-gnu/libm-2.17.so
[...]
7f5c817d3000-7f5c817d4000 r--p 00021000 08:06 561168 /lib/x86_64-linux-gnu/ld-2.17.so
7f5c817d4000-7f5c817d5000 rw-p 00022000 08:06 561168 /lib/x86_64-linux-gnu/ld-2.17.so
7f5c817d5000-7f5c817d6000 rw-p 00000000 00:00 0
7fff985e7000-7fff98608000 rw-p 00000000 00:00 0 [stack]
7fff986c5000-7fff986c7000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
When the content of the stack does not fit in its allowed size, the process crashes with a segmentation fault due to a stack overflow. In the case of the main thread, this is dynamically handled by the kernel, while additional threads’ stacks are preceded5 by a small no-access region, so that any access there is illegal and generates the segmentation fault. As a consequence, a deeply recursive function can cause a stack overflow, since each recursion creates a new frame.
Each time a function is called, a new frame is pushed by putting the parameters and various state information on the top of the stack. When the function returns, its frame is popped and the stack is set back to its previous state. The exact size and content of a frame depend on the ABI and thus on the operating system, the architecture of the computer and even on the compilation options. However, it generally grows with the number of parameters and the number of local variables of the callee.
Traditionally, the size of the frame was statically defined by the
compiler. However, dynamic allocations are allowed in a non-standard
fashion using the alloca(3)
call.
Each call to alloca
extends the current frame. There is no call
to release or resize an alloca
allocation. Since the size of
the stack is limited, alloca
must be handled with care to
avoid stack overflows6.
Putting a call to alloca
in a loop or having a call to
alloca
with a dynamically defined size (that may be large)
are bad ideas.
Lifetime
A common (wrong) belief is that each local variable of a function has its own memory slot allocated on the stack. However, even if this may be true when the program is compiled with no optimization, this is usually wrong. The compiler job is to ensure that the resulting process is as fast as possible. As a consequence, it will try to keep the data in the registers of the CPU whenever possible in order to reduce the access latency, but since the number of registers is limited, when there are too many variables (or when we need a pointer to one of them), some must be put on the stack. The compiler analyzes the accesses to each variable and perform a register and stack allocation pass based on the result.
The result of that pass is that two variables that are never used at the same time may share the same register and/or the same location on the stack. A variable may also not always be assigned to the same register or memory location. Thus you must keep in mind that even if the memory on the stack remains valid as long as the frame is active (that is as long as the function didn’t return), a particular memory location can be used by several local variables belonging to different scopes:
void my_broken_function()
{
int b;
void *ptr;
{
int a = 1;
ptr = &a;
}
/* ptr is now invalid */
b = 2;
assert (&b == ptr); /* this assert may be true because b wasn't "alive" before
so the compiler is allowed to delay it's allocation on the stack
and to reuse the placeholder of a "dead" variable" such as "a"
that is now out of scope */
{
int c = 3;
assert (&c == ptr); /* this assert may be true too */
}
}
As a consequence, a pointer to a variable placed on the stack must not exit the lexical scope of that particular variable and a fortiori it must not be propagated as a returned value of the function. The compiler is able to report some broken usage of pointers to stacked variables, but that’s only a small part of the possible errors. So you must be very careful when using pointers to variables on the stack, and especially ensure that the pointers never escape (that is, they never get returned or stored in a global variables or other variables with a longer lifetime) their validity scopes.
Heap
Owner | The caller of brk() |
---|---|
Lifetime | Until another brk() invalidates the memory |
Initialization | 0 |
The heap is a special memory area that has a fixed location and can be
grown. It is managed using the brk()
and sbrk()
standard calls. Those calls let you add or remove memory at the end of
the heap with the granularity of a page. Traditionally these calls were
supposed to grow the data segment (that is increasing the map
corresponding to the .bss
section). However, this was at a
time where memory management was much less advanced and memory was much
more constrained. Nowadays, the virtual address space is large and there
is no need to keep it packed anymore.
As you can see in the content of /proc/[pid]/maps
,
on modern operating systems, the [heap]
is still locating
near the beginning of the memory address space, after the mapping of
the executable (and thus after the .bss
section), but
there is a large gap between the .bss
section and the heap:
As for the stack, the size of the heap is subject to a resource limitation.
In practice, you will never have to manage the heap manually.
It is used internally by most malloc()
implementations.
mmap()
Owner | The caller |
---|---|
Lifetime | Until corresponding munmap s |
Initialization | File content or 0 |
The mmap
system call is the primitive to interact with
the virtual memory manager of the kernel. It reserves any of the
categories of memory we defined in the first article
with the granularity of the page. That granularity does not forbid you
to map a file which size is not a multiple of the page size, the
remaining area is the last page will simply be filled with zeros (which
won’t be added to disk)7.
When calling mmap
you manipulate the properties of an
internal lookup structure of the kernel, you don’t actually allocate a
chunk of memory. As a consequence running mmap
several times
on the same page (which is possible using the MAP_FIXED
flag
) won’t create a new map, it will just change the attributes of the pages
at the specified address. munmap
, the mmap
counterpart, removes a particular set of pages from the kernel lookup
structure. There is no need to call mmap
and munmap
in a symetric fashion: you can both unmap several maps with a single
large munmap
or unmap a single map with several
smaller munmap
.
A small example to illustrate this all:
** mmap a file of 256MB and ensure it is aligned on 256MB.
*/
void *mmap_256(int fd)
{
#define FILE_SIZE (256 << 20)
/* Create a first map to find a sequence of pages that can contain
* a sequence of 256MiB properly aligned
*/
byte *ptr = mmap(NULL, 2 * FILE_SIZE, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_SHARED, -1, 0);
byte *aligned_ptr;
if (ptr == MAP_FAILED) {
return ptr;
}
/* Find an aligned pointer in the mapped chunk
*/
aligned_ptr = (byte *)ROUND_UP((uintptr_t)ptr, FILE_SIZE);
/* Unmap pages that do not belong to the aligned 256MB long chunk.
*
* This is done by calling munmap twice: once for the pages at the beginning
* of the map, and once for those at the end of the map.
*/
if (aligned_ptr != ptr) {
munmap(ptr, aligned_ptr - ptr);
}
munmap(aligned_ptr + FILE_SIZE, FILE_SIZE - (aligned_ptr - ptr));
/** Map the file in the remaining pages.
*
* Provides both an explicit address (from aligned_ptr) and the MAP_FIXED flags
* to tell the kernel that we really want to map those pages. Since these are pages we just mapped
* we know they are available.
*/
return mmap(aligned_ptr, FILE_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_FIXED, fd, 0);
}
/** Unmap a file mapped by mmap_256
*/
void munmap_256(void *ptr)
{
/* A simple mnumap suffice */
if (munmap(ptr, FILE_SIZE) < 0) {
assert (false);
}
}
malloc()
Owner | The caller |
---|---|
Lifetime | Until corresponding free() |
Initialization | None (random content) |
malloc()
is probably the most widely known source of memory in C. It is the
standard dynamic allocator with its siblings calloc()
,
realloc()
, memalign()
, free()
,
… It works by requesting big chunks of memory from the kernel,
splitting them into smaller ones that satisfy the size requested by the
caller as well as limiting the overhead in both memory and CPU.
The Linux ecosystem is full of different implementations of
malloc()
. The most notable implementations are probably
the ptmalloc
family and
tcmalloc
from Google. The glibc uses the ptmalloc2
implementation which
itself is an adaptation of Doug Lea’s malloc (dlmalloc
).
That version uses a somehow complex algorithm
which varies depending on the size of the requested allocation. It uses
two sources of memory. For small allocations, it uses the heap while
for big ones, it will fallback on mmap()
. The default
threshold is dynamically estimated and is by default between 128KiB and
64MiB. This can be changed using the options M_MMAP_THRESHOLD
and M_MMAP_MAX
of the mallopt()
call.
The allocator splits the heap into some lists of non-allocated memory
chunks. When an allocation is requested, it traverses its lists looking
for the more suitable chunk. When no chunk can be found, it grows the
heap by issuing a brk()
system call and put the additional
memory in its internal lists.
When a chunk of memory is free’d (by calling free()
on
the pointer previously returned by the malloc()
call), it is added in a free-chunk list. When the last pages of the
heap are free, those pages are released and the heap is reduced (by
issuing another brk()
call). As a consequence, calling
free()
in a program may not have a direct impact of the actual memory used by
the process. If the program performs a lot of allocation / deallocation
with various lifetime, the heap may be fragmented: the heap cannot be
reduced even if it contains a lot of free memory. Nevertheless, calling
free()
on a big chunk of memory (big enough for being allocated
using the fallback on mmap()
) will call munmap
and thus will have a direct impact of the memory consumption of your process.
ptmalloc2
has a memory overhead of at least 8 bytes per
allocation on 64-bit system. These bytes are used internally to store
some meta-data about the allocation. This means it is extremely
memory-inefficient for very small allocations (but this is the case for
most malloc()
implementations). Those administrative bytes
are stored just before the returned pointer, which means they can easily
be inadvertently overwritten by a memory underflow. The consequence of
such an overwrite is a corruption of the allocator that will later
lead to a segmentation fault ((Later means that the crash will seem
totally random on a first sight)).
Nowadays, a challenge of a good malloc()
implementation
is to behave properly in a multithreaded environment. Most of the time,
this is done by having thread-local caches and a shared heap. The
allocator will first try to find a matching chunk in the local cache and
fallback to the heap (which requires locking) only if no matching
element can be found.
Next: Intersec’s custom allocators
After spending three full (and quite long) articles to lay the foundation for memory understanding, in the next post we will cover some of the specific implementations we chose to develop at Intersec to match our own requirements.
- the size of the cache can be found in the CPU specification, the access times have been computed using a small program that traverses randomly a memory range on various memory sizes. Gaps clearly appear in the resulting data-set when the size of the memory range crosses the size of a cache. [↩]
- You should also avoid mixing in the same cache line variables that are mostly read-only with variables that are mostly written in order to avoid cache contention in concurrent programming. [↩]
- If it is really small and fit in the
mapping of the
.data
section, it just won’t have a dedicated mapping [↩] - since the map is private, the zeros written there are not visible to other processes [↩]
- The frames are stacked from the end to the beginning of the memory region. [↩]
- Coding conventions will often forbid
alloca()
because it is both dangerous and non-portable. [↩] - Note that if for some reason you reduce the size of an already
mmap
ed file (usingtruncate()
) without reducing the size of the map accordingly, accessing the the part of the mapping that has no underlying storage anymore will cause aSIGBUS
. This, as far as we can tell, is the most common situation that can lead to aSIGBUS
on Linux. [↩]