Where the Light Is Best: Why Performance Analysis Needs a Method

A service is slow. Within ten minutes there are four people in the incident channel and four theories: it’s the database, it’s the network, it’s that deploy from yesterday, it’s “probably GC.” Someone has top open on a shared screen and everyone is squinting at it as if the answer will scroll past. Nobody can say what slow means in numbers, and nobody has a plan for what to look at after top.¹

I want to spend this series working through Linux performance properly, from the kernel up, leaning on the two books that taught me most of what I know about it: Brendan Gregg’s Systems Performance² and BPF Performance Tools.³ Later posts get into CPUs, memory, file systems, disks, and the network, with a lot of time spent in tracing tools. But the tools are the easy part. The thing that actually separates a good performance engineer from someone flailing in an incident channel is method, so that’s where we start.

The Streetlight Problem

There’s an old joke Gregg uses to name the most common failure mode. A drunk is searching the ground under a streetlight. A passing officer helps him look for his keys, finds nothing, and asks whether he’s sure he lost them here. “No,” says the drunk, “but this is where the light is best.”

That’s top. Not because top is a bad tool, but because the usual reason it’s open is that it’s the tool the person knows, not because anything about the problem points to it. Gregg calls this the streetlight anti-method: analysis by walking through whichever observability tools happen to be familiar, hoping something obvious shows up. Sometimes something does show up, and that’s the trap, because a busy system always has something that looks suspicious. You find a problem, fix it, and the original issue is still there, because what you found was never quantified against the complaint.

Two relatives of the streetlight deserve a name too, because you will watch them happen in real incidents:

The random change anti-method: guess where the problem is, flip a tunable, measure, flip it back, measure, keep whatever seemed faster. This eventually produces a config full of settings nobody understands, some of which were only ever workarounds for bugs that have since been fixed, and at least one of which will detonate under peak load.

The blame-someone-else anti-method: pick a component you’re not responsible for, hypothesize the issue lives there, and route the ticket. “Maybe it’s the network. Can you check with the network team about dropped packets or something?” The tell is a hypothesis with no data leading to it. When you’re on the receiving end, ask for the output of whatever tool produced the suspicion. There usually isn’t one.

None of these are strawmen. They’re the default. A methodology is what you do instead, and the rest of this post is the three I reach for, in the order I reach for them.

Start by Asking, Not Measuring

The first methodology doesn’t involve logging in to anything. Gregg calls it the problem statement, and it’s just a disciplined version of asking what’s actually wrong:

What makes you think there is a performance problem?
Has this system ever performed well?
What changed recently? Software, hardware, load?
Can the problem be expressed in terms of latency or runtime?
Does it affect other people or applications, or just you?
What is the environment? Software, hardware, versions, configuration?

It feels almost too obvious to write down, which is exactly why nobody does it. Question 3 alone resolves a respectable fraction of issues (“well, we did ship a new connection pool yesterday”). Question 1 sometimes dissolves the problem entirely: the graph that triggered the page turns out to measure something other than what its title says. Gregg mentions having closed issues over the phone with these six questions and no server access. I believe him; I’ve done it over Slack.

Question 4 matters more than it looks, and it’s worth dwelling on why.

Latency Is the Currency

Counters tell you how much: packets, IOPS, context switches. None of that says whether anyone is waiting. Latency does. It’s the time an operation took, which makes it the one metric that converts directly into user pain and the one unit in which every component of the stack can be compared. You can’t compare 30,000 IOPS against 40% CPU. You can compare 2 ms spent in disk I/O against 11 ms spent waiting for a CPU.

What makes systems latency hard to reason about is the range. Gregg has a table in Systems Performance that scales everything to human time: if one 3.5 GHz CPU cycle (0.3 ns) took one second, then a main memory access takes 6 minutes. An SSD read takes 9 hours to 4 days. A rotational disk read takes 1 to 12 months. A packet from San Francisco to the UK, 8 years. A TCP timer-based retransmit, one to three centuries.

That table is the whole game in miniature. When an application stalls on a disk read, on this scale it’s an office worker stepping out for most of a year while the CPU sits there executing nothing a billion times over. It’s also why “the CPU is only at 40%” tells you nothing about latency: the interesting question is never how busy things are, it’s what the request is waiting on. Hold the scale of that table in your head and a lot of performance intuition comes free. Caching stops being an optimization and starts being the only reason computers feel fast at all.

So: express the problem as latency (question 4), and you get something you can decompose. The request took 100 ms. Where did they go? 10 ms on CPU, 70 ms waiting on disk I/O, 20 ms waiting on a lock. Each of those can be decomposed again. That’s latency analysis, and most of the tracing tools later in this series exist to do exactly this decomposition without guessing.

For Every Resource: Utilization, Saturation, Errors

The problem statement works top-down from the complaint. The USE method works bottom-up from the hardware, and it’s the thing to run early when the complaint is vague (“everything is slow”) or when you suspect a systemic bottleneck. The whole method fits in one sentence:

For every resource, check utilization, saturation, and errors.

The terms are precise. Utilization is the percentage of time the resource was busy servicing work. Saturation is the degree to which work is queued because the resource can’t accept more: run queue length, swap activity, device queue depth. Errors are just error event counts.

The order you check them is part of the method. Errors first, because they’re quick to rule out, objectively bad, and good at hiding. An operation that fails and silently retries shows up as latency, not as a log line anyone reads. Saturation second, because any sustained saturation is a problem: it means a queue is forming, and queues are where latency lives. Utilization last, because it needs the most interpretation. 60% busy averaged over a minute can conceal bursts of 100%.

What I actually like most about the USE method is the discipline it imposes before any tools come out. You enumerate the resources first (CPUs, memory, storage devices, network interfaces, the interconnects and buses people forget), then write down the three questions for each, and only then go hunting for tools that answer them. Iterating over resources instead of iterating over tools is the inversion that kills the streetlight problem. And when you can’t find a tool to answer one of the questions, you’ve still gained something: a known-unknown, written down, instead of a blind spot you don’t know you have. A busy CPU is easy to see. Nobody’s dashboard shows you the question they never asked.

The USE method finds the bottlenecks that are there; it doesn’t promise the first one you find is the one behind today’s complaint. A real system can have several problems at once. You find one, quantify it against the problem statement, and either it explains the numbers or you keep iterating.

The First Sixty Seconds

Methodology sounds slow, and incidents aren’t. So here is the compromise, straight from the book and from Netflix practice before that: ten commands, about a minute, nothing but tools already installed on any reasonable Linux box. It’s an abbreviated USE pass over the major resources.

uptime                    # load averages: rising or falling? compare 1/5/15 min
dmesg -T | tail           # kernel errors, OOM kills. errors first, remember
vmstat -SM 1              # run queue length, free memory, swapping, system-wide CPU
mpstat -P ALL 1           # per-CPU balance: one hot CPU = single-threaded bottleneck
pidstat 1                 # which processes, user vs system time
iostat -sxz 1             # disk: IOPS, throughput, await, %util
free -m                   # memory, including how much is really page cache
sar -n DEV 1              # network device throughput, packets/s
sar -n TCP,ETCP 1         # connection rates and, critically, retransmits
top                       # tie it together, catch anything the others missed

A few of these repay attention to detail. The three load averages in uptime are an instant trend line: if the 1-minute number is well below the 15-minute number, the event may already be over and you’re doing archaeology. dmesg is second on purpose. An OOM kill explains a lot of mysteries in one line, and almost nobody checks. In vmstat, the r column is the saturation metric for CPUs: runnable threads, counting the ones executing. When it exceeds the CPU count, threads are queueing. mpstat -P ALL catches the classic case the system-wide average hides, which is one CPU pinned at 100% by a single-threaded process while fifteen others idle, reported as “6% CPU usage.” And the ETCP retransmit counters are the fastest cheap signal that the problem might genuinely be the network, with data this time, before you go talk to the network team.

Sixty seconds, and you’ve either found a smoking gun or, just as usefully, ruled out entire subsystems and earned the right to slow down and do real analysis on what’s left.

Why This Matters

Tools age; the method doesn’t. vmstat columns have looked the same for decades, the BPF tools we’ll use later didn’t exist ten years ago, and whatever replaces them will answer the same three questions about the same short list of resources. Learn the questions and every new tool is just a better flashlight, pointed this time at where you actually dropped the keys.

Next post: how observability on Linux actually works under the tools. /proc and counters, tracepoints, kprobes, uprobes, PMCs, and where perf, Ftrace, and BPF fit. That’s the foundation the rest of the series builds on.

References

Gregg, B. (2015). Linux Performance Analysis in 60,000 Milliseconds. Netflix Technology Blog. ↩
Gregg, B. (2020). Systems Performance: Enterprise and the Cloud, 2nd Edition. Addison-Wesley. Chapters 1–2. ↩
Gregg, B. (2019). BPF Performance Tools. Addison-Wesley. Chapter 3. ↩