Better Flashlights: How Linux Observability Works Under the Tools

Last post ended with a claim: learn the questions and every new tool is just a better flashlight. This one is about the flashlights. Not the tools themselves yet, but the machinery underneath them, because there is less of it than the tool landscape suggests. top, perf, Ftrace, bpftrace, every dashboard your monitoring vendor sells you: all of them sit on the same small set of kernel event sources. Once you know what those sources are and what each one costs, tools stop being magic and start being interchangeable.¹²

A useful way to carve up the space: every observability tool is answering one of three questions. How much? That’s counters. Where is the time going? That’s profiling. What exactly happened, and how long did it take? That’s tracing. The sources below map onto those questions.

Counters: The Statistics You Already Paid For

The kernel keeps counters for everything it does (packets, interrupts, context switches, page faults) and it keeps them whether or not anyone is looking. They’re maintained unconditionally, so reading them adds essentially nothing to system load. Gregg calls them “free.” When vmstat or iostat or top prints a number, it almost always came from one of two pseudo-filesystems:

$ cat /proc/loadavg
0.52 0.58 0.59 2/1029 318483
$ grep ctxt /proc/stat
ctxt 6571418641

/proc holds per-process directories (/proc/PID/stat, /proc/PID/smaps) plus system-wide files like /proc/meminfo and /proc/diskstats. /sys grew out of it for device and kernel state, and netlink sockets serve the network side; ss and ip speak netlink rather than parsing text. It’s all plain text on purpose: when a tool lies to you, or you suspect it might be, you can cat the file it read and check. Tracing a tool to see which /proc files it opens (strace -e openat top) is a genuinely useful debugging trick. Observability for your observability.

Counters answer how much, and the 60-second checklist from last post is counters end to end. What they can’t tell you is anything about individual events. /proc/diskstats will say the disk did 12,000 reads; it cannot say which process issued the slow ones, or that 200 of them took 50 ms while the average looked fine. Averages are where latency outliers go to hide. For the per-event story you need instrumentation, and the kernel offers it in two flavors.

Tracepoints: Instrumentation Someone Promised to Maintain

Tracepoints are static instrumentation: hooks that kernel developers placed by hand at logical points such as system call entry and exit, scheduler events, block I/O issue and completion, and page allocation. They’re named subsystem:eventname, so disk I/O completion is block:block_rq_complete and a process calling execve fires sched:sched_process_exec. When no tracer is attached, a tracepoint is a few NOP instructions sitting in the instruction stream, close enough to free that they ship enabled in every production kernel.

The word that matters is stable. A tracepoint is an API.³ Its name and arguments are maintained across kernel versions, which means a tool built on tracepoints keeps working when you upgrade. The price of that promise is scarcity: somebody has to write and maintain each one, so you get on the order of a thousand tracepoints, placed where the maintainers thought visibility mattered, which is not necessarily where your problem is.

kprobes and uprobes: Instrumenting Code That Never Asked for It

This is the part that still feels slightly illegal the first time you see it work. A kprobe can instrument nearly any of the ~50,000 functions in a running kernel. No recompile, no reboot, no module. You name a function; the kernel makes it traceable while the system runs.

The mechanism is worth knowing because it explains both the power and the cost. When you attach a kprobe, the kernel copies the instruction bytes at the target address and replaces them with a breakpoint, int3 on x86_64. When execution hits it, the breakpoint handler recognizes the kprobe, runs your handler, then executes the saved original instructions and resumes as if nothing happened. When the probe is removed, the original bytes go back. Live binary patching of the kernel, as a service. (For function entries there’s a faster path that piggybacks on Ftrace’s compiled __fentry__ hook, and return probes, called kretprobes, work by hijacking the return address through a trampoline so you can measure function duration.)

uprobes are the same idea pointed at user space: instrument any function or instruction in any binary or library on the system, with the breakpoint trick performed on the process image instead. Want to see every SSL_read in your TLS library, or every malloc your application makes? That’s a uprobe, no application change required.

The catch is the inverse of the tracepoint promise: kernel functions are not an API. The function you traced today can be renamed, inlined away, or refactored out of existence in the next kernel release, and your tooling silently breaks. Hence the standing rule, straight from the book: use tracepoints first if they exist and suffice; reach for kprobes as a backup. (User space has a tracepoint equivalent too. USDT probes are statically compiled markers in applications and runtimes like Java, PostgreSQL, and libc, with the same stable-but-sparse trade-off.)

The disabled cost differs too: kprobes cost literally nothing when not in use (the code is unmodified), while tracepoints carry their few NOPs forever. Nobody has ever noticed the NOPs.

PMCs: What the Silicon Saw

Everything above observes software. Performance monitoring counters live in the processor itself and are the only way to answer questions like: how many cycles did this code burn, how often did it miss the L3 cache, how badly is it stalled on memory? On the human-scaled latency table from last post, a DRAM access was six minutes to a cycle’s one second. PMCs are how you find out you’re paying that six minutes over and over.

There are hundreds of countable events but only a handful of hardware counter registers, roughly six per core you can program at once, so you choose your events or time-share them. PMCs run in two modes: counting, where they just tally (this is what perf stat shows, at near-zero cost), and overflow sampling, where every Nth event fires an interrupt that grabs an instruction pointer or stack. Sampling is how you find the code with the cache misses, not just the count of them.

$ perf stat -a sleep 10
...
     45,476,612,925      cycles
     34,227,054,303      instructions   #  0.75 insn per cycle

Instructions per cycle is the single most useful PMC-derived number: it tells you whether your “100% CPU” is a CPU actually executing (IPC heading toward 1 and beyond on modern hardware) or a CPU quietly waiting on memory while pretending to be busy (IPC well under that). Two workloads at identical utilization can differ several-fold in real work done. More on this in the CPUs post.

Stack Traces, or: Turn Frame Pointers Back On

Almost every interesting question is answered by a stack trace taken at the event: who issued the slow I/O, what code path allocated the memory, where was the program when it blocked. The cheap, reliable way to walk a stack is the frame pointer convention, where each function keeps the address of the previous frame in a register (RBP on x86_64), forming a linked list a tracer can follow in a few instructions.

The bad news: compilers treat RBP as a free register and omit frame pointers at any optimization level, which turns your profiles into a single mystery frame. The fix is -fno-omit-frame-pointer where you compile, and runtime-appropriate flags where you don’t (e.g., -XX:+PreserveFramePointer for the JVM). The cost is small; flying blind costs more. Distributions are slowly coming around. Fedora now builds its packages with frame pointers, and other distros are having the same argument in their mailing lists. Check your own binaries before you need them, not during the incident.

The Tracers: perf, Ftrace, BPF

So who consumes these sources? Three front-ends matter on modern Linux, and all three speak tracepoints, kprobes, uprobes, and PMCs:

perf is the official Linux profiler: a record-and-report design that samples stacks, counts PMCs, and dumps per-event data for offline analysis. Ftrace lives in the kernel itself, controlled through tracefs files. It has no userspace dependency at all, which makes it the tool that always works, including on the stripped-down box that has nothing installed. BPF is the newest and the reason tracing got interesting again: small verified programs attached directly to events, running in the kernel, computing answers (histograms, per-process latency maps, filtered event streams) and handing user space the summary instead of a firehose of raw events. That in-kernel aggregation is the whole trick. It’s the difference between shipping every block I/O event to a process for post-processing and shipping one latency histogram per second.

You’ll mostly touch BPF through two front-ends: BCC, a library with ~100 ready-made tools (biolatency, execsnoop, tcplife, the cast of the rest of this series), and bpftrace, an awk-flavored language for one-liners when no canned tool fits:

# count syscalls by process name, system-wide, in-kernel
$ bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'

One line, attached to a stable tracepoint, aggregated in a kernel map, negligible overhead. Ten years ago that was a kernel module and a bad week.

Why This Matters

The stack is: counters for how much, tracepoints and probes for what happened, PMCs for what the silicon did, stack traces for who did it, and three tracers that multiplex it all. Every tool in the rest of this series is a thin layer over exactly these pieces. When biolatency prints a histogram, you now know it’s a kprobe or tracepoint on the block layer feeding a BPF map. Which also means that when no tool exists for your question, you know what to build it from.

Next post we put all of it to work on the resource everyone blames first: CPUs. What utilization actually measures (and hides), run queues, IPC, and profiling with flame graphs.

References

Gregg, B. (2020). Systems Performance: Enterprise and the Cloud, 2nd Edition. Addison-Wesley. Chapters 4, 13–15. ↩
Gregg, B. (2019). BPF Performance Tools. Addison-Wesley. Chapter 2. ↩
Desnoyers, M. (2009). Tracepoint infrastructure, Linux 2.6.32. kernel.org documentation: Using the Linux Kernel Tracepoints. ↩