%CPU is the most-watched performance number in computing. It’s on every dashboard, it drives autoscaling decisions worth real money, and it’s the first thing anyone quotes in an incident. It is also quietly misleading in both directions: a CPU at 90% may be doing very little actual work, and a system at 50% may already be making your users wait. This post is about what the number really measures, and the two metrics that tell you what it hides. Instructions per cycle on one side, run queue latency on the other.

This is part three of the series; the methodology and the observability machinery are in the first two posts, and both get used heavily here.

What Utilization Actually Measures

CPU utilization is the percentage of time the CPU was not idle. That sounds like “percentage of time doing work,” but the definition spans all clock cycles of eligible activity, including cycles spent stalled waiting for memory. A CPU stuck on a DRAM access for the equivalent of six human-scaled minutes (the latency table from post one) is counted exactly as busy as a CPU retiring an instruction every cycle.

This isn’t a corner case. Gregg notes that on the Netflix cloud, CPU utilization is mostly memory stall cycles. Picture that for a second: the fleet-wide “busy” number on one of the largest cloud deployments in the world mostly measures waiting. The CPU shows 90%, and a large fraction of that is the processor holding its breath while the memory subsystem fetches cache lines.

The practical consequence cuts deep: if your workload is memory-stalled, buying faster CPUs mostly buys you faster waiting. More stall cycles per DRAM access, the same completed work per second. The money should have gone to faster memory, or better memory locality in the software. You cannot tell which situation you’re in from %CPU. You can tell from IPC.

IPC: Asking the Silicon What It Did

Instructions per cycle comes from the PMCs covered last post, and it’s the cheapest deep measurement you will ever take:

$ perf stat -a -- sleep 10

     45,476,612,925      cycles
     34,227,054,303      instructions   #  0.75 insn per cycle
     ...

Low IPC means stalled, typically on memory access. High IPC means the CPU is genuinely executing. Where the line sits depends on the processor, but as rough texture: workloads down around 0.2 are badly memory-bound, and a modern wide core can retire well above 1 when the code and data behave. Run perf stat on a workload you believe is “CPU-bound” and you may find that the honest description is “DRAM-bound, with the CPU as an expensive spectator.”

So the first split: utilization high and IPC low means you tune memory access, not compute. Profile for cache misses, think about data locality and NUMA. Utilization high and IPC high means the CPU really is the bottleneck, and now profiling for hot code paths (below) will pay off. Same dashboard number, opposite remediations.

While we’re correcting the busy number, check who’s inflating it: user time versus system time (pidstat splits them per process). A compute-heavy workload runs near 99/1 user-to-kernel; an I/O-heavy server pushes real time into the kernel through syscalls. A workload whose system time suddenly grew has changed its behavior, usually toward more I/O or more lock contention via futex calls, even if total %CPU looks flat.

The Other Direction: Waiting for a CPU

Utilization overstates work; it also understates pain, because it says nothing about threads that are ready to run and can’t. That’s saturation, the queue is the run queue, and every millisecond a thread spends there is added straight onto somebody’s request latency.

The oldest saturation signal is the load average triplet from uptime, which on Linux is quirkier than most people remember. Since 1993 Linux load averages have measured system-wide demand, not just CPU: they include threads in the uninterruptible sleep state (state D in ps), which usually means waiting on disk I/O. A load of 34 on a 32-CPU box might be a healthy CPU-bound system or a modest CPU load plus a pile of threads stuck on a sick disk. The numbers are exponentially damped moving averages, so they’re a trend line, not a gauge. Compare the 1-minute against the 15-minute to see direction, then move to better metrics.

Better metrics: vmstat’s r column counts runnable threads system-wide (including running ones), and sustained r above your CPU count is saturation, plainly. Newer kernels also expose pressure stall information in /proc/pressure/cpu, which states directly what fraction of time tasks stalled waiting for a CPU. But the measurement I actually want is the latency itself: how long do threads wait in the queue?

Measuring the Queue: runqlat

runqlat from BCC instruments scheduler events and produces a histogram of run queue latency, the time from “thread became runnable” to “thread got a CPU.” Here’s a 48-CPU system at about 42% utilization:

# runqlat 10 1
Tracing run queue latency... Hit Ctrl-C to end.

     usecs               : count     distribution
         0 -> 1          : 3149     |                                        |
         2 -> 3          : 304613   |****************************************|
         4 -> 7          : 274541   |************************************    |
         8 -> 15         : 58576    |*******                                 |
        16 -> 31         : 15485    |**                                      |
        32 -> 63         : 24877    |***                                     |

Healthy: the mass of wakeups gets a CPU within a handful of microseconds. Now the same tool on a system where a software build was accidentally launched with 72 parallel jobs on fewer CPUs:

# runqlat 10 1

     usecs               : count     distribution
         2 -> 3          : 22087    |****************************************|
         4 -> 7          : 21245    |**************************************  |
        64 -> 127        : 7370     |*************                           |
       128 -> 255        : 13001    |***********************                 |
       256 -> 511        : 4823     |********                                |

A second mode has appeared in the hundreds of microseconds: threads queueing for hardware. This is the histogram view doing what averages can’t. That bimodal shape would have melted into one bland mean, and the mean would have looked fine. When you see queueing like this, the fixes are the boring ones: fewer runnable threads, more CPUs, or CPU limits and priorities so the latency-sensitive work jumps the queue.

The companion tool runqlen samples run queue length instead of latency. It’s cheaper and coarser, useful for confirming queueing exists before measuring how much it hurts.

Profiling: Where the Cycles Go

If IPC and runqlat say the CPU is genuinely busy executing your code, the next question is which code, and the answer is a sampled profile: grab the stack trace at a fixed rate (99 Hz is the convention, just off 100 to avoid striding in lockstep with timed activity) and count which paths show up.

# the classic
perf record -F 99 -a -g -- sleep 30 && perf report

# the BPF way: stacks counted in-kernel, summary out
profile -F 99 30

The two differ in plumbing exactly the way last post described: perf writes every sample to a file and post-processes, while BCC’s profile aggregates stack counts in a kernel map and emits totals. Less overhead, and no perf.data to fill your disk.

Read the result as a flame graph. Rules of the road, since they’re perpetually misread: the x-axis is alphabetical, not time, so left to right means nothing. Width is everything: a function’s width is how often it appeared in samples. The top edge is who was actually on-CPU, and everything below a frame is its ancestry, the path that got it there. Find the widest towers, look at their top edges, and you have your hot code. If instead you’re staring at a single frame of hex addresses, you’ve met the frame-pointer problem from last post. Go recompile with -fno-omit-frame-pointer and come back.

The Half That Profiling Can’t See

A CPU profile only samples threads that are on CPU. A thread blocked on a lock, a disk read, or a full network buffer is invisible to it, and blocked time is frequently where the latency actually lives. The complement is off-CPU analysis: instrument the scheduler context-switch path and record how long threads spend off CPU and the stack that put them there. That’s offcputime from BCC, and the output reads like a confession: this stack, blocked in futex_wait, for this many total milliseconds.

It’s heavier than profiling (scheduler events vastly outnumber 99 Hz samples), so it’s a scalpel rather than a dashboard. But the pairing is the point: on-CPU profile for where cycles burn, off-CPU stacks for where time vanishes. Between them, all wall-clock time is accounted for. We’ll lean on off-CPU analysis again in the file system and disk posts, because that’s where most of the blocking turns out to come from.

Why This Matters

The next time a dashboard shows a CPU at 90%, you have three questions that take five minutes to answer. Is it executing or stalled? (perf stat, look at IPC.) Is anything queueing behind it? (runqlat, look for the second mode.) And if it’s real work, whose code? (profile, read the flame graph.) Three different answers, three completely different fixes, one dashboard number that couldn’t distinguish them.

Next post: memory. Where those stall cycles come from, what the kernel does behind your back with page cache and reclaim, and why the OOM killer is never as random as it feels.

References

  1. Gregg, B. (2020). Systems Performance: Enterprise and the Cloud, 2nd Edition. Addison-Wesley. Chapter 6.
  2. Gregg, B. (2019). BPF Performance Tools. Addison-Wesley. Chapter 6.
  3. Gregg, B. (2017). CPU Utilization Is Wrong. brendangregg.com.