Busy Doing Nothing: CPU Utilization, Run Queues, and IPC
%CPU is the most-watched performance number in computing. It’s on every dashboard, it drives autoscaling decisions worth real money, and it’s the first thing anyone quotes in an incident. It is also quietly misleading in both directions: a CPU at 90% may be doing very little actual work, and a system at 50% may already be making your users wait. This post is about what the number really measures, and the two metrics that tell you what it hides. Instructions per cycle on one side, run queue latency on the other.
This is part three of the series; the methodology and the observability machinery are in the first two posts, and both get used heavily here.
What Utilization Actually Measures
CPU utilization is the percentage of time the CPU was not idle. That sounds like “percentage of time doing work,” but the definition spans all clock cycles of eligible activity, including cycles spent stalled waiting for memory. A CPU stuck on a DRAM access for the equivalent of six human-scaled minutes (the latency table from post one) is counted exactly as busy as a CPU retiring an instruction every cycle.
This isn’t a corner case. Gregg notes that on the Netflix cloud, CPU utilization is mostly memory stall cycles. Picture that for a second: the fleet-wide “busy” number on one of the largest cloud deployments in the world mostly measures waiting. The CPU shows 90%, and a large fraction of that is the processor holding its breath while the memory subsystem fetches cache lines.
The practical consequence cuts deep: if your workload is memory-stalled, buying faster CPUs mostly buys you faster waiting. More stall cycles per DRAM access, the same completed work per second. The money should have gone to faster memory, or better memory locality in the software. You cannot tell which situation you’re in from %CPU. You can tell from IPC.
IPC: Asking the Silicon What It Did
Instructions per cycle comes from the PMCs covered last post, and it’s the cheapest deep measurement you will ever take:
$ perf stat -a -- sleep 10
45,476,612,925 cycles
34,227,054,303 instructions # 0.75 insn per cycle
...
Low IPC means stalled, typically on memory access. High IPC means the CPU
is genuinely executing. Where the line sits depends on the processor, but
as rough texture: workloads down around 0.2 are badly memory-bound, and a
modern wide core can retire well above 1 when the code and data behave.
Run perf stat on a workload you believe is “CPU-bound” and you may find
that the honest description is “DRAM-bound, with the CPU as an expensive
spectator.”
So the first split: utilization high and IPC low means you tune memory access, not compute. Profile for cache misses, think about data locality and NUMA. Utilization high and IPC high means the CPU really is the bottleneck, and now profiling for hot code paths (below) will pay off. Same dashboard number, opposite remediations.
While we’re correcting the busy number, check who’s inflating it: user time
versus system time (pidstat splits them per process). A compute-heavy
workload runs near 99/1 user-to-kernel; an I/O-heavy server pushes real
time into the kernel through syscalls. A workload whose system time
suddenly grew has changed its behavior, usually toward more I/O or more
lock contention via futex calls, even if total %CPU looks flat.
The Other Direction: Waiting for a CPU
Utilization overstates work; it also understates pain, because it says nothing about threads that are ready to run and can’t. That’s saturation, the queue is the run queue, and every millisecond a thread spends there is added straight onto somebody’s request latency.
The oldest saturation signal is the load average triplet from uptime,
which on Linux is quirkier than most people remember. Since 1993 Linux load
averages have measured system-wide demand, not just CPU: they include
threads in the uninterruptible sleep state (state D in ps), which
usually means waiting on disk I/O. A load of 34 on a 32-CPU box might be a
healthy CPU-bound system or a modest CPU load plus a pile of threads stuck
on a sick disk. The numbers are exponentially damped moving averages, so
they’re a trend line, not a gauge. Compare the 1-minute against the
15-minute to see direction, then move to better metrics.
Better metrics: vmstat’s r column counts runnable threads system-wide
(including running ones), and sustained r above your CPU count is
saturation, plainly. Newer kernels also expose pressure stall information
in /proc/pressure/cpu, which states directly what fraction of time tasks
stalled waiting for a CPU. But the measurement I actually want is the
latency itself: how long do threads wait in the queue?
Measuring the Queue: runqlat
runqlat from BCC instruments scheduler events and produces a histogram of
run queue latency, the time from “thread became runnable” to “thread got a
CPU.” Here’s a 48-CPU system at about 42% utilization:
# runqlat 10 1
Tracing run queue latency... Hit Ctrl-C to end.
usecs : count distribution
0 -> 1 : 3149 | |
2 -> 3 : 304613 |****************************************|
4 -> 7 : 274541 |************************************ |
8 -> 15 : 58576 |******* |
16 -> 31 : 15485 |** |
32 -> 63 : 24877 |*** |
Healthy: the mass of wakeups gets a CPU within a handful of microseconds. Now the same tool on a system where a software build was accidentally launched with 72 parallel jobs on fewer CPUs:
# runqlat 10 1
usecs : count distribution
2 -> 3 : 22087 |****************************************|
4 -> 7 : 21245 |************************************** |
64 -> 127 : 7370 |************* |
128 -> 255 : 13001 |*********************** |
256 -> 511 : 4823 |******** |
A second mode has appeared in the hundreds of microseconds: threads queueing for hardware. This is the histogram view doing what averages can’t. That bimodal shape would have melted into one bland mean, and the mean would have looked fine. When you see queueing like this, the fixes are the boring ones: fewer runnable threads, more CPUs, or CPU limits and priorities so the latency-sensitive work jumps the queue.
The companion tool
runqlensamples run queue length instead of latency. It’s cheaper and coarser, useful for confirming queueing exists before measuring how much it hurts.
Profiling: Where the Cycles Go
If IPC and runqlat say the CPU is genuinely busy executing your code, the next question is which code, and the answer is a sampled profile: grab the stack trace at a fixed rate (99 Hz is the convention, just off 100 to avoid striding in lockstep with timed activity) and count which paths show up.
# the classic
perf record -F 99 -a -g -- sleep 30 && perf report
# the BPF way: stacks counted in-kernel, summary out
profile -F 99 30
The two differ in plumbing exactly the way last post described: perf
writes every sample to a file and post-processes, while BCC’s profile
aggregates stack counts in a kernel map and emits totals. Less overhead,
and no perf.data to fill your disk.
Read the result as a flame graph. Rules of the road, since they’re
perpetually misread: the x-axis is alphabetical, not time, so left to
right means nothing. Width is everything: a function’s width is how often
it appeared in samples. The top edge is who was actually on-CPU, and
everything below a frame is its ancestry, the path that got it there. Find
the widest towers, look at their top edges, and you have your hot code. If
instead you’re staring at a single frame of hex addresses, you’ve met the
frame-pointer problem from last post. Go recompile with
-fno-omit-frame-pointer and come back.
The Half That Profiling Can’t See
A CPU profile only samples threads that are on CPU. A thread blocked on a
lock, a disk read, or a full network buffer is invisible to it, and blocked
time is frequently where the latency actually lives. The complement is
off-CPU analysis: instrument the scheduler context-switch path and record
how long threads spend off CPU and the stack that put them there. That’s
offcputime from BCC, and the output reads like a confession: this stack,
blocked in futex_wait, for this many total milliseconds.
It’s heavier than profiling (scheduler events vastly outnumber 99 Hz samples), so it’s a scalpel rather than a dashboard. But the pairing is the point: on-CPU profile for where cycles burn, off-CPU stacks for where time vanishes. Between them, all wall-clock time is accounted for. We’ll lean on off-CPU analysis again in the file system and disk posts, because that’s where most of the blocking turns out to come from.
Why This Matters
The next time a dashboard shows a CPU at 90%, you have three questions that
take five minutes to answer. Is it executing or stalled? (perf stat, look
at IPC.) Is anything queueing behind it? (runqlat, look for the second
mode.) And if it’s real work, whose code? (profile, read the flame
graph.) Three different answers, three completely different fixes, one
dashboard number that couldn’t distinguish them.
Next post: memory. Where those stall cycles come from, what the kernel does behind your back with page cache and reclaim, and why the OOM killer is never as random as it feels.
References
- Gregg, B. (2020). Systems Performance: Enterprise and the Cloud, 2nd Edition. Addison-Wesley. Chapter 6.
- Gregg, B. (2019). BPF Performance Tools. Addison-Wesley. Chapter 6.
- Gregg, B. (2017). CPU Utilization Is Wrong. brendangregg.com.