Better Flashlights: How Linux Observability Works Under the Tools
Last post ended with a claim: learn the questions and every new tool is just
a better flashlight. This one is about the flashlights. Not the tools
themselves yet, but the machinery underneath them, because there is less of
it than the tool landscape suggests. top, perf, Ftrace, bpftrace, every
dashboard your monitoring vendor sells you: all of them sit on the same
small set of kernel event sources. Once you know what those sources are and
what each one costs, tools stop being magic and start being interchangeable.
A useful way to carve up the space: every observability tool is answering one of three questions. How much? That’s counters. Where is the time going? That’s profiling. What exactly happened, and how long did it take? That’s tracing. The sources below map onto those questions.
Counters: The Statistics You Already Paid For
The kernel keeps counters for everything it does (packets, interrupts,
context switches, page faults) and it keeps them whether or not anyone is
looking. They’re maintained unconditionally, so reading them adds
essentially nothing to system load. Gregg calls them “free.” When vmstat
or iostat or top prints a number, it almost always came from one of two
pseudo-filesystems:
$ cat /proc/loadavg
0.52 0.58 0.59 2/1029 318483
$ grep ctxt /proc/stat
ctxt 6571418641
/proc holds per-process directories (/proc/PID/stat, /proc/PID/smaps)
plus system-wide files like /proc/meminfo and /proc/diskstats. /sys
grew out of it for device and kernel state, and netlink sockets serve the
network side; ss and ip speak netlink rather than parsing text. It’s
all plain text on purpose: when a tool lies to you, or you suspect it might
be, you can cat the file it read and check. Tracing a tool to see which
/proc files it opens (strace -e openat top) is a genuinely useful
debugging trick. Observability for your observability.
Counters answer how much, and the 60-second checklist from last post is
counters end to end. What they can’t tell you is anything about individual
events. /proc/diskstats will say the disk did 12,000 reads; it cannot say
which process issued the slow ones, or that 200 of them took 50 ms while the
average looked fine. Averages are where latency outliers go to hide. For the
per-event story you need instrumentation, and the kernel offers it in two
flavors.
Tracepoints: Instrumentation Someone Promised to Maintain
Tracepoints are static instrumentation: hooks that kernel developers
placed by hand at logical points such as system call entry and exit,
scheduler events, block I/O issue and completion, and page allocation.
They’re named subsystem:eventname, so disk I/O completion is
block:block_rq_complete and a process calling execve fires
sched:sched_process_exec. When no tracer is attached, a tracepoint is a
few NOP instructions sitting in the instruction stream, close enough to
free that they ship enabled in every production kernel.
The word that matters is stable. A tracepoint is an API. Its name and arguments are maintained across kernel versions, which means a tool built on tracepoints keeps working when you upgrade. The price of that promise is scarcity: somebody has to write and maintain each one, so you get on the order of a thousand tracepoints, placed where the maintainers thought visibility mattered, which is not necessarily where your problem is.
kprobes and uprobes: Instrumenting Code That Never Asked for It
This is the part that still feels slightly illegal the first time you see it work. A kprobe can instrument nearly any of the ~50,000 functions in a running kernel. No recompile, no reboot, no module. You name a function; the kernel makes it traceable while the system runs.
The mechanism is worth knowing because it explains both the power and the
cost. When you attach a kprobe, the kernel copies the instruction bytes at
the target address and replaces them with a breakpoint, int3 on x86_64.
When execution hits it, the breakpoint handler recognizes the kprobe,
runs your handler, then executes the saved original instructions and
resumes as if nothing happened. When the probe is removed, the original
bytes go back. Live binary patching of the kernel, as a service. (For
function entries there’s a faster path that piggybacks on Ftrace’s compiled
__fentry__ hook, and return probes, called kretprobes, work by hijacking
the return address through a trampoline so you can measure function
duration.)
uprobes are the same idea pointed at user space: instrument any function
or instruction in any binary or library on the system, with the breakpoint
trick performed on the process image instead. Want to see every
SSL_read in your TLS library, or every malloc your application makes?
That’s a uprobe, no application change required.
The catch is the inverse of the tracepoint promise: kernel functions are not an API. The function you traced today can be renamed, inlined away, or refactored out of existence in the next kernel release, and your tooling silently breaks. Hence the standing rule, straight from the book: use tracepoints first if they exist and suffice; reach for kprobes as a backup. (User space has a tracepoint equivalent too. USDT probes are statically compiled markers in applications and runtimes like Java, PostgreSQL, and libc, with the same stable-but-sparse trade-off.)
The disabled cost differs too: kprobes cost literally nothing when not in use (the code is unmodified), while tracepoints carry their few NOPs forever. Nobody has ever noticed the NOPs.
PMCs: What the Silicon Saw
Everything above observes software. Performance monitoring counters live in the processor itself and are the only way to answer questions like: how many cycles did this code burn, how often did it miss the L3 cache, how badly is it stalled on memory? On the human-scaled latency table from last post, a DRAM access was six minutes to a cycle’s one second. PMCs are how you find out you’re paying that six minutes over and over.
There are hundreds of countable events but only a handful of hardware
counter registers, roughly six per core you can program at once, so you
choose your events or time-share them. PMCs run in two modes: counting,
where they just tally (this is what perf stat shows, at near-zero cost),
and overflow sampling, where every Nth event fires an interrupt that grabs
an instruction pointer or stack. Sampling is how you find the code with the
cache misses, not just the count of them.
$ perf stat -a sleep 10
...
45,476,612,925 cycles
34,227,054,303 instructions # 0.75 insn per cycle
Instructions per cycle is the single most useful PMC-derived number: it tells you whether your “100% CPU” is a CPU actually executing (IPC heading toward 1 and beyond on modern hardware) or a CPU quietly waiting on memory while pretending to be busy (IPC well under that). Two workloads at identical utilization can differ several-fold in real work done. More on this in the CPUs post.
Stack Traces, or: Turn Frame Pointers Back On
Almost every interesting question is answered by a stack trace taken at the event: who issued the slow I/O, what code path allocated the memory, where was the program when it blocked. The cheap, reliable way to walk a stack is the frame pointer convention, where each function keeps the address of the previous frame in a register (RBP on x86_64), forming a linked list a tracer can follow in a few instructions.
The bad news: compilers treat RBP as a free register and omit frame
pointers at any optimization level, which turns your profiles into a single
mystery frame. The fix is -fno-omit-frame-pointer where you compile, and
runtime-appropriate flags where you don’t (e.g.,
-XX:+PreserveFramePointer for the JVM). The cost is small; flying blind
costs more. Distributions are slowly coming around. Fedora now builds its
packages with frame pointers, and other distros are having the same
argument in their mailing lists. Check your own binaries before you need
them, not during the incident.
The Tracers: perf, Ftrace, BPF
So who consumes these sources? Three front-ends matter on modern Linux, and all three speak tracepoints, kprobes, uprobes, and PMCs:
perf is the official Linux profiler: a record-and-report design that samples stacks, counts PMCs, and dumps per-event data for offline analysis. Ftrace lives in the kernel itself, controlled through tracefs files. It has no userspace dependency at all, which makes it the tool that always works, including on the stripped-down box that has nothing installed. BPF is the newest and the reason tracing got interesting again: small verified programs attached directly to events, running in the kernel, computing answers (histograms, per-process latency maps, filtered event streams) and handing user space the summary instead of a firehose of raw events. That in-kernel aggregation is the whole trick. It’s the difference between shipping every block I/O event to a process for post-processing and shipping one latency histogram per second.
You’ll mostly touch BPF through two front-ends: BCC, a library with ~100
ready-made tools (biolatency, execsnoop, tcplife, the cast of the
rest of this series), and bpftrace, an awk-flavored language for one-liners
when no canned tool fits:
# count syscalls by process name, system-wide, in-kernel
$ bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'
One line, attached to a stable tracepoint, aggregated in a kernel map, negligible overhead. Ten years ago that was a kernel module and a bad week.
Why This Matters
The stack is: counters for how much, tracepoints and probes for what
happened, PMCs for what the silicon did, stack traces for who did it,
and three tracers that multiplex it all. Every tool in the rest of this
series is a thin layer over exactly these pieces. When biolatency prints
a histogram, you now know it’s a kprobe or tracepoint on the block layer
feeding a BPF map. Which also means that when no tool exists for your
question, you know what to build it from.
Next post we put all of it to work on the resource everyone blames first: CPUs. What utilization actually measures (and hides), run queues, IPC, and profiling with flame graphs.
References
- Gregg, B. (2020). Systems Performance: Enterprise and the Cloud, 2nd Edition. Addison-Wesley. Chapters 4, 13–15.
- Gregg, B. (2019). BPF Performance Tools. Addison-Wesley. Chapter 2.
- Desnoyers, M. (2009). Tracepoint infrastructure, Linux 2.6.32. kernel.org documentation: Using the Linux Kernel Tracepoints.