At Intersec we chose the C programming language because it gives us a full control on what we’re doing, and achieves a high level of performances. For many people, performance is just about using as few CPU instructions as possible. However, on modern hardware it’s much more complicated than just CPU. Algorithms have to deal with memory, CPU, disk and network I/Os… Each of them adds to the cost of the algorithm and each of them must be properly understood in order to guarantee both the performance and the reliability of the algorithm.
The impact of CPU (and as a consequence, the algorithmic complexity) on performances is well understood, as are disk and network latencies. However the memory seems much less understood. As our experience with our customers shows, even the output of widely used tools, such as
top, are cryptic to most system administrators.
This post is the first in a series of five about memory. We will deal with topics such as the definition of memory, how it is managed, how to read the output of tools… This series will address subjects that will be of interest for both developers and system administrators. While most rules should apply to most modern operating systems, we’ll talk more specifically about Linux and the C programming language.
We’re not the first ones to write about memory. In particular, we’d like to highlight the high quality paper by Ulricht Drepper: What every programmer should know about memory.
This first post will provide a definition of memory. It supposes at least a basic knowledge of notions such as an address or a process. It also often deals with subjects such as system calls and the difference between user-land and kernel mode, however, what you need to know is that your process (user-land) runs above the kernel that itself talks to the hardware, and that system calls let your process talks to the kernel in order to request more resources. You can get details about the system calls by reading their respective manual pages.
On modern operating systems, each process lives in its own memory allocation space. Instead of mapping memory addresses directly to hardware addresses, the operating system serves as a hardware abstraction layer and creates a virtual memory space for each process. The mapping between the physical memory address and the virtual address is done by the CPU using a per-process translation table maintained by the kernel (each time the kernel changes the running process on a specific CPU core, it changes the translation table of that CPU).
Virtual memory has several purposes. First, it allows process isolation. A process in userland can only express memory accesses as addresses in the virtual memory. As a consequence it can only access data that has been previously mapped in its own virtual space and thus cannot access the memory of other processes (unless explicitly shared).
The second purpose is the abstraction of the hardware. The kernel is free to change the physical address to which a virtual address is mapped. It can also choose not to provide any physical memory for a specific virtual address until it becomes actually needed. Moreover it can swap out the memory to disk when it has not been used for a long time and the system is getting short of physical memory. This globally gives a lot of freedom to the kernel, its only constraint is that when the program reads the memory it actually finds what it previously wrote there.
The third purpose is the possibility to give addresses to things that are not actually in RAM. This is the principle behind
mmap and mapping files. You can give a virtual memory address to a file so that it can be accessed as if it was a memory buffer. This is a very useful abstraction that helps keeping the code quite simple and, since on 64-bit systems you have a huge virtual space1, if you want, you can map your whole hard drive to the virtual memory.
The fourth purpose is sharing. Since the kernel knows what process is mapped in virtual space of the various running processes, it can avoid loading stuff twice in memory and make the virtual addresses of processes that use the same resources point to the same physical memory (even if the actual virtual address is specific to each process). A consequence of sharing is the use of copy-on-write (COW) by the kernel: when two processes use the same data but one of them modifies it while the other one is not allowed to see the change, the kernel will make the copy when the data get modified. More recently, operating systems have also gained the ability to detect identical memory in several address spaces and automatically make them map to the same physical memory (marking them as subject to COW)2, on Linux this is called KSM (Kernel SamePage Merging).
The best known use case of COW is
fork(). On Unix-like systems,
fork() is the system call that creates a process by duplicating the current one. When
fork() returns, both processes continue at exactly the same point, with the same opened files and the same memory. Thanks to COW,
fork() will not duplicate the memory of a process when you fork it, only data that are modified by either the parent of the child get duplicated in RAM. Since most uses of
fork() are immediately followed by a call to
exec() that invalidates the whole virtual memory addressing space, the COW mechanism avoids a full useless copy of the memory of the parent process.
Another side effect, is that
fork() creates a snapshot of the (private) memory of a process at little cost. If you want to perform some operation on the memory of a process without taking the risk of it to be modified under your feet, and don’t want to add a costly and error-prone locking mechanism, just fork, do your work and communicate the result of your computation back to your parent process (by return code, file, shared memory, pipe, …).
This will work extremely well as long as your computation is fast-enough so that a large part of the memory remains shared between both the parent and the child processes. This also helps keeping your code simple, the complexity is hidden in the virtual-memory code of the kernel, not in yours.
The virtual memory is divided in pages. The size of a page size is imposed by the CPU and is usually 4KiB3. What this means is that memory management in the kernel is done with a granularity of a page. When you require new memory, the kernel will give you one or more pages, when you release memory, you release one or more pages… Every finer-grained API (e.g.
malloc) is implemented in user land.
For each allocated page, the kernel keeps a set of permissions: the page can be readable, writable and/or executable (note that not all combinations are possible). These permissions are set either while mapping the memory or by using the
mprotect() system call afterward. Pages that have not been allocated yet, are not accessible. When you try to perform a forbidden action on a page (for example, reading data from a page without the read permission), you’ll trigger (on Linux) a Segmentation Fault. As a side note, you may see that since the segmentation fault has a granularity of a page, you may perform out-of-buffer accesses that don’t lead to a segfault.
Not all memory allocated in the virtual memory space is the same. We can classify it through two axis: the first axis is whether memory is private (specific to that process) or shared, the second axis is whether the memory is file-backed or not (in which case it is said the be anonymous). This creates a classification with 4 memory classes:
Private memory is, as its name says, memory that is specific to the process. Most of the memory you deal with in a program is actually private memory.
Since changes made in private memory are not visible to other processes, it is subject to copy-on-write. As a side-effect, this means that even if the memory is private, several processes might share the same physical memory to store the data. In particular, this is the case for binary files and shared libraries. A common misbelief is that KDE takes a lot of RAM because every single process loads Qt and the KDElibs, however, thanks to the COW mechanism, all the processes will use the exact same physical memory for the read-only parts of those libs.
In case of file-backed private memory, the changes made by the process are not written back to the underlying file, however changes made to the file may or may not be made available to the process.
Shared memory is something designed for inter-process communication. It can only be created by explicitly requesting it using the right
mmap() call or a dedicated call (
shm*). When a process writes in a shared memory, the modification is seen by all the processes that map the same memory.
In case the memory is file-backed, any process mapping the file will see the changes in the file since those changes are propagated through the file itself.
Anonymous memory is purely in RAM. However, the kernel will not actually map that memory to a physical address before it gets actually written. As a consequence, anonymous memory does not add any pressure on the kernel before it is actually used. This allows a process to “reserve” a lot of memory in its virtual memory address space without really using RAM. As a consequence, the kernel lets you reserve more memory than actually available. This behavior is often referenced as over-commit (or memory overcommitment).
File-backed and Swap
When a memory map is file-backed, the data is loaded from the disk. Most of the time, it is loaded on demand, however, you can give hints to the kernel so that it can prefetch memory ahead of read. This helps keeping your program snappy when you know you have a particular pattern of accesses (mostly sequential accesses). In order to avoid using too much RAM, you can also tell the kernel that you don’t care to have the pages in RAM anymore without unmapping the memory. All this is done using the
madvise() system call.
When the system falls short of physical memory, the kernel will try to move some data from RAM to the disk. If the memory is file-backed and shared, this is quite easy. Since the file is the source of the data, it is just removed from RAM, then the next time it will be read, it will be loaded from the file.
The kernel can also choose to remove anonymous/private memory from RAM. In which case that data is written in a specific place on disk. It’s said to be swapped out. On Linux, the swap is usually stored in a specific partition, on other systems this can be specific files. Then, it just works the same way it works for file-backed memory: when it gets accessed, it is read from the disk and reloaded in RAM.
Thanks to the use of a virtual addressing space, swapping pages in and out is totally transparent for the process… what is not, though, is the latency induced by the disk I/O.
Next: Resident Memory and Tools
We’ve covered here some important notions about memory. While we talked a few times about physical memory and the difference with reserved address spaces, we avoided dealing with the actual memory pressure of the process. We will address that topic and describe some tools that let you understand the memory consumption of a process in the next article.
Nowadays, consumer CPUs have 48bits of addressing, which means 248 bytes addressable, that is 256TiB ↩
This is useful in case you have a large amount of memory present several times, for example when you run a virtualization server and have several virtual machines running the same OS. ↩
On Intel CPU, there is a second alternative size for the size: 2MiB, however on Linux the number of 2MiB pages available is limited and requires mapping a pseudo-file, which makes them awkward to use. ↩