At some point in every engineer’s life, someone runs free -m on a healthy server, sees 2% free, and opens an incident. The number is true and the panic is wrong, and the distance between those two facts is most of what’s interesting about Linux memory management. The kernel is running a ruthless little economy behind your back: lending idle RAM to the file cache, deferring every promise it can, and when genuinely cornered, selecting a process to kill. None of it is visible in free unless you know which columns to believe.

This is part four of the series. The CPU post kept blaming memory for stalls; this one is where those stalls come from, and what happens when memory itself becomes the contested resource.

malloc Is a Promise, Not a Payment

Start with the fact that explains half the weird numbers you’ll ever see: allocating memory does almost nothing. When a process calls malloc(), the allocator extends the virtual address space and returns. No physical page is touched. Only when the process actually stores to that memory does the MMU discover there’s no mapping, raise a page fault, and trap into the kernel, which only then finds a physical page and wires it up. This is demand paging, and it’s why a process’s virtual size (VSZ) is fiction and its resident set size (RSS) is the number that matters. A JVM that “allocated” 30 GB may be touching 4.

Faults come in two severities, and the distinction matters for diagnosis. A minor fault is satisfied from memory already at hand: a new page from the free list, or mapping an already-resident shared library page. Cheap, normal, the sound of a process growing. A major fault requires storage I/O, either an unmapped file page or, worse, a page coming back from swap. Major faults are milliseconds in a nanosecond world; a process with a high major fault rate is being quietly tortured, and ps will happily show you (the MAJFLT column) while showing %CPU as nearly idle.

Linux compounds the promise-making with overcommit: by default it will grant more virtual memory than it could ever back, on the actuarial bet that processes never touch everything they ask for. Mostly the bet pays. When it doesn’t, settlement arrives via the OOM killer. We’ll get there.

Where All the Memory Went

Back to the panicked free -m. The memory isn’t missing; it’s working:

$ free -m
              total        used        free      shared  buff/cache   available
Mem:          64231       12608         801         300       50821       50613

Almost everything not used by processes is buff/cache, which is overwhelmingly the page cache: file contents kept in RAM after reads and before writes hit disk. This is the kernel’s single best performance feature. Recall the scaled latencies from post one: DRAM is six minutes, disk is months. Every file read served from page cache is a months-to-minutes conversion, and the kernel will convert as much idle RAM into that win as it can. Free memory serves nobody. The cache can be reclaimed nearly instantly when processes need the pages, which is why the column to read is available (free plus everything cheaply reclaimable), and why this box has 50 GB of headroom, not 800 MB.

The flip side: a “memory-bound” symptom is sometimes a cache problem in disguise. Shrink the page cache by growing process memory and your disk I/O goes up, your file system latency goes up, and the incident gets filed against the database. We’ll meet this again in the file systems post.

Reclaim: The Part You Pay Latency For

When free pages run low, the kernel starts taking them back, and the mechanism has a fast lane and a painful one.

The background path is kswapd, the page-out daemon. Woken when free memory dips below a threshold, it walks LRU lists of inactive pages, evicting clean page cache (free to drop; the file is still on disk), writing back dirty pages, and paging out cold anonymous memory if swap exists. Done well, applications never notice; kswapd skims while they sleep.

The painful path is direct reclaim. If memory pressure outruns kswapd, allocations start performing reclaim synchronously. Your application thread, mid-allocation, stops and goes hunting for pages to free, possibly waiting on disk writeback. This is memory pressure converting directly into application latency, and it’s invisible in utilization metrics: the thread isn’t running, it isn’t blocked on its own I/O, it’s doing the kernel’s janitorial work. Two good flashlights here: vmstat shows kswapd activity trending up before trouble, and BCC’s drsnoop traces direct reclaim events per process with the latency each one ate. Modern kernels also state it plainly in /proc/pressure/memory, the fraction of time tasks stalled on memory, which is the USE method’s saturation metric handed to you on a plate.

Which pages get sacrificed first, file cache or process memory, is the swappiness tunable (0–100, default 60; higher favors paging out application memory to preserve warm file cache). That default surprises people: the kernel will genuinely choose to page out your cold heap to keep hot file data resident, and for throughput it’s often right.

Swap, No Swap, and the Killer

Note what “swapping” means on Linux: paging out individual cold pages, not the ancient Unix move of evicting whole processes. Page-level swap under moderate pressure is survivable; it’s the grace period in which an application with a leak gets slow before it gets dead, and slow is debuggable. You can catch it live (vmstat’s si/so columns nonzero is the tell) and go find the growth.

Run without swap, which is increasingly the norm, and there’s no grace period. The leak grows until allocations can’t be satisfied, and the kernel invokes the OOM killer, which scores processes (select_bad_process(), a function name with no chill) and kills the loser, logging Out of memory: Kill process to the kernel log. This is why dmesg is step two of the 60-second checklist: an OOM kill is the answer to a surprising number of “the service just vanished” mysteries, sitting in plain sight where nobody looked.

Whether no-swap is the right call is a real trade-off, not a hygiene rule. Netflix runs swapless on purpose: with a large pool of instances behind a load balancer, a fast OOM kill plus traffic redirection beats one instance slowly drowning in page-outs and dragging latency for everyone routed to it. Fail fast and loud, or degrade slow and debuggable. Pick one based on your architecture, not folklore.

One modern wrinkle that catches everyone: cgroup limits. In a container world, the memory accounting that matters is the cgroup’s, and a host with 50 GB available will still OOM-kill a container that hit its own limit. If a process died of OOM on a host that looks empty, check the container limits before doubting the kernel.

Watching It Happen

The working set for memory triage, in roughly the order I reach:

free -m                        # believe 'available', not 'free'
dmesg -T | grep -i 'out of'    # has the killer already been here?
vmstat -SM 1                   # si/so nonzero = actively paging; watch 'free' trend
cat /proc/pressure/memory      # stall time: the honest saturation number
ps -eo pid,rss,maj_flt,comm --sort=-rss | head   # who's resident, who's faulting
drsnoop                        # direct reclaim: who paid, how long
memleak -p $(pgrep -f myapp)   # outstanding allocations by stack, for growth

memleak deserves a sentence: it instruments the allocation and free paths and reports allocations that were never freed, with the stack that made them, which converts “RSS grows 200 MB/hour” from a graph into a function name. On a long-running process, that’s the difference between a restart cron job and a fix.

Why This Matters

The kernel’s memory behavior looks adversarial right up until you learn its incentives. Then it looks like a colleague with strong opinions about caching. Free memory is inventory doing nothing, so it lends it out. Promises are cheaper than pages, so it overcommits. When the bill comes due, it pays with your latency (reclaim) or someone’s life (OOM), and both events are observable in advance if you watch pressure instead of utilization.

Next post: the file systems those cached pages belong to. VFS, why open() latency is a thing, and measuring where file I/O time actually goes before you blame the disks underneath.

References

  1. Gregg, B. (2020). Systems Performance: Enterprise and the Cloud, 2nd Edition. Addison-Wesley. Chapter 7.
  2. Gregg, B. (2019). BPF Performance Tools. Addison-Wesley. Chapter 7.
  3. Corbet, J. (2004). Kswapd and high-order allocations. LWN.net.