Free Memory Is Memory Doing Nothing: Page Cache, Reclaim, and the OOM Killer
At some point in every engineer’s life, someone runs free -m on a healthy
server, sees 2% free, and opens an incident. The number is true and the
panic is wrong, and the distance between those two facts is most of what’s
interesting about Linux memory management. The kernel is running a
ruthless little economy behind your back: lending idle RAM to the file
cache, deferring every promise it can, and when genuinely cornered,
selecting a process to kill. None of it is visible in free unless you
know which columns to believe.
This is part four of the series. The CPU post kept blaming memory for stalls; this one is where those stalls come from, and what happens when memory itself becomes the contested resource.
malloc Is a Promise, Not a Payment
Start with the fact that explains half the weird numbers you’ll ever see:
allocating memory does almost nothing. When a process calls malloc(), the
allocator extends the virtual address space and returns. No physical page
is touched. Only when the process actually stores to that memory does the
MMU discover there’s no mapping, raise a page fault, and trap into the
kernel, which only then finds a physical page and wires it up. This is
demand paging, and it’s why a process’s virtual size (VSZ) is fiction and
its resident set size (RSS) is the number that matters. A JVM that
“allocated” 30 GB may be touching 4.
Faults come in two severities, and the distinction matters for diagnosis. A
minor fault is satisfied from memory already at hand: a new page from the
free list, or mapping an already-resident shared library page. Cheap,
normal, the sound of a process growing. A major fault requires storage
I/O, either an unmapped file page or, worse, a page coming back from swap.
Major faults are milliseconds in a nanosecond world; a process with a high
major fault rate is being quietly tortured, and ps will happily show you
(the MAJFLT column) while showing %CPU as nearly idle.
Linux compounds the promise-making with overcommit: by default it will grant more virtual memory than it could ever back, on the actuarial bet that processes never touch everything they ask for. Mostly the bet pays. When it doesn’t, settlement arrives via the OOM killer. We’ll get there.
Where All the Memory Went
Back to the panicked free -m. The memory isn’t missing; it’s working:
$ free -m
total used free shared buff/cache available
Mem: 64231 12608 801 300 50821 50613
Almost everything not used by processes is buff/cache, which is
overwhelmingly the page cache: file contents kept in RAM after reads and
before writes hit disk. This is the kernel’s single best performance
feature. Recall the scaled latencies from post one: DRAM is six minutes,
disk is months. Every file read served from page cache is a
months-to-minutes conversion, and the kernel will convert as much idle RAM
into that win as it can. Free memory serves nobody. The cache can be
reclaimed nearly instantly when processes need the pages, which is why the
column to read is available (free plus everything cheaply reclaimable),
and why this box has 50 GB of headroom, not 800 MB.
The flip side: a “memory-bound” symptom is sometimes a cache problem in disguise. Shrink the page cache by growing process memory and your disk I/O goes up, your file system latency goes up, and the incident gets filed against the database. We’ll meet this again in the file systems post.
Reclaim: The Part You Pay Latency For
When free pages run low, the kernel starts taking them back, and the mechanism has a fast lane and a painful one.
The background path is kswapd, the page-out daemon. Woken when free
memory dips below a threshold, it walks LRU lists of inactive pages,
evicting clean page cache (free to drop; the file is still on disk),
writing back dirty pages, and paging out cold anonymous memory if swap
exists. Done well, applications never notice; kswapd skims while they
sleep.
The painful path is direct reclaim. If memory pressure outruns kswapd,
allocations start performing reclaim synchronously. Your application
thread, mid-allocation, stops and goes hunting for pages to free, possibly
waiting on disk writeback. This is memory pressure converting directly into
application latency, and it’s invisible in utilization metrics: the thread
isn’t running, it isn’t blocked on its own I/O, it’s doing the kernel’s
janitorial work. Two good flashlights here: vmstat shows kswapd activity
trending up before trouble, and BCC’s drsnoop traces direct reclaim
events per process with the latency each one ate. Modern kernels also state
it plainly in /proc/pressure/memory, the fraction of time tasks stalled
on memory, which is the USE method’s saturation metric handed to you on a
plate.
Which pages get sacrificed first, file cache or process memory, is the
swappiness tunable (0–100, default 60; higher favors paging out
application memory to preserve warm file cache). That default surprises
people: the kernel will genuinely choose to page out your cold heap to keep
hot file data resident, and for throughput it’s often right.
Swap, No Swap, and the Killer
Note what “swapping” means on Linux: paging out individual cold pages, not
the ancient Unix move of evicting whole processes. Page-level swap under
moderate pressure is survivable; it’s the grace period in which an
application with a leak gets slow before it gets dead, and slow is
debuggable. You can catch it live (vmstat’s si/so columns nonzero is
the tell) and go find the growth.
Run without swap, which is increasingly the norm, and there’s no grace
period. The leak grows until allocations can’t be satisfied, and the kernel
invokes the OOM killer, which scores processes (select_bad_process(), a
function name with no chill) and kills the loser, logging Out of memory:
Kill process to the kernel log. This is why dmesg is step two of the
60-second checklist: an OOM kill is the answer to a surprising number of
“the service just vanished” mysteries, sitting in plain sight where nobody
looked.
Whether no-swap is the right call is a real trade-off, not a hygiene rule. Netflix runs swapless on purpose: with a large pool of instances behind a load balancer, a fast OOM kill plus traffic redirection beats one instance slowly drowning in page-outs and dragging latency for everyone routed to it. Fail fast and loud, or degrade slow and debuggable. Pick one based on your architecture, not folklore.
One modern wrinkle that catches everyone: cgroup limits. In a container world, the memory accounting that matters is the cgroup’s, and a host with 50 GB available will still OOM-kill a container that hit its own limit. If a process died of OOM on a host that looks empty, check the container limits before doubting the kernel.
Watching It Happen
The working set for memory triage, in roughly the order I reach:
free -m # believe 'available', not 'free'
dmesg -T | grep -i 'out of' # has the killer already been here?
vmstat -SM 1 # si/so nonzero = actively paging; watch 'free' trend
cat /proc/pressure/memory # stall time: the honest saturation number
ps -eo pid,rss,maj_flt,comm --sort=-rss | head # who's resident, who's faulting
drsnoop # direct reclaim: who paid, how long
memleak -p $(pgrep -f myapp) # outstanding allocations by stack, for growth
memleak deserves a sentence: it instruments the allocation and free paths
and reports allocations that were never freed, with the stack that made
them, which converts “RSS grows 200 MB/hour” from a graph into a function
name. On a long-running process, that’s the difference between a restart
cron job and a fix.
Why This Matters
The kernel’s memory behavior looks adversarial right up until you learn its incentives. Then it looks like a colleague with strong opinions about caching. Free memory is inventory doing nothing, so it lends it out. Promises are cheaper than pages, so it overcommits. When the bill comes due, it pays with your latency (reclaim) or someone’s life (OOM), and both events are observable in advance if you watch pressure instead of utilization.
Next post: the file systems those cached pages belong to. VFS, why
open() latency is a thing, and measuring where file I/O time actually
goes before you blame the disks underneath.
References
- Gregg, B. (2020). Systems Performance: Enterprise and the Cloud, 2nd Edition. Addison-Wesley. Chapter 7.
- Gregg, B. (2019). BPF Performance Tools. Addison-Wesley. Chapter 7.
- Corbet, J. (2004). Kswapd and high-order allocations. LWN.net.