The Disk Didn't Do It: File System Latency Is What Your App Feels
When storage gets blamed for slowness, everyone’s first move is iostat,
which is disk metrics. Reasonable instinct, wrong layer. Applications don’t
perform disk I/O; they perform file system I/O. Between a read() and any
platter or flash chip sit the VFS, the page cache, read-ahead, write-back,
file system journaling, and a volume manager, each of which can absorb the
I/O entirely, multiply it, defer it for thirty seconds, or hand the
application a latency that has nothing to do with how the disk is feeling.
The latency your application experiences is file system latency, and
that’s the layer to measure first.
Part five of the series. The page cache did a cameo in the memory post; here it’s a main character.
What Sits Between read() and the Disk
Every file operation enters through the VFS, the kernel’s abstraction that
makes ext4, XFS, ZFS, NFS, and tmpfs all answer the same interface. Below
it, the file system consults a stack of caches: the page cache for file
contents, the dentry and inode caches for the name lookups and metadata
that make open() cheap. A read that hits page cache returns in
microseconds, DRAM speed, the months-to-minutes conversion from the
memory post. A read that misses goes to the device and costs whatever the
device costs.
Writes are stranger. By default they’re write-back: the application’s
write() copies into the page cache, the page is marked dirty, and the
call returns at memory speed. The actual disk write happens later,
batched and coalesced, when the flusher threads get around to it. Tens of
seconds later is normal. The application experiences a fast write that
hasn’t happened yet; durability arrives asynchronously, unless it asks for
fsync() and pays the real price. (This is also why write latency
“to disk” measured at the application is often fiction in both directions:
suspiciously fast when buffered, then horrifying when an fsync flushes
a backlog.)
And reads are speculative: detect a sequential pattern and the file system prefetches ahead of the application, turning what would be a string of cache misses into hits. When read-ahead guesses right, it’s free performance. When it guesses wrong, on random access that looks briefly sequential, it’s the disk doing work nobody asked for.
Logical vs. Physical I/O, or Why iostat Can’t Convict Anyone
Add this up and you get the core idea of the chapter, the one worth keeping: the I/O your application issues (logical) and the I/O the disk sees (physical) are different events, related by bookkeeping you can’t see from either end alone. Gregg’s taxonomy of the mismatch:
Deflated: application I/O that shrinks or vanishes before the disk. Cache hits, writes to the same offset cancelled in cache, compression, tmpfs that never touches storage at all.
Inflated: disk I/O bigger than what the application asked for. Metadata to record where everything lives, record-size rounding, RAID parity, and journaling, which can double writes outright, once to the journal and once to the destination.
Indirect, implicit, unrelated: I/O shifted in time (write-back bursts, prefetch), I/O from mmap’d files where a load instruction triggers a disk read with no syscall anywhere in sight, and I/O that simply isn’t yours. Backups, RAID rebuilds, the other tenant on your cloud hypervisor.
Now look at iostat again: the disk is 80% busy with 400 writes/sec.
Whose? The application wrote nothing in the last second. Is the burst
write-back from a minute ago, journal traffic, a backup? The disk metrics
can’t say. This is why descending to the disk layer first is an
interrogation of the wrong witness. Measure where the application actually
waits, then descend with a question instead of a suspicion.
Measuring What the Application Feels
The direct measurement is per-operation file system latency, and the BCC
tools make it almost embarrassingly easy. ext4slower (and its siblings
xfsslower, zfsslower, nfsslower) traces common operations including
reads, writes, opens, and fsyncs, and prints only those slower than a
threshold, 10 ms by default:
# xfsslower
Tracing XFS operations slower than 10 ms
TIME COMM PID T BYTES OFF_KB LAT(ms) FILENAME
02:04:07 java 5565 R 63559 360237 17.16 shuffle_2_63762_0.data
Read that line: a specific process, a specific operation, a specific file,
17 milliseconds. This is evidence of the kind iostat structurally cannot
produce, because it ties latency to a victim. Run it during the complaint
window and you either get lines (the file system layer is implicated, and
you know exactly where) or silence (it isn’t, and you just exonerated the
entire storage stack in one command; go look at locks or the network).
That binary outcome makes it one of the highest-value first moves in any
“storage is slow” investigation.
For the distribution rather than the outliers, ext4dist gives latency
histograms per operation type. As with runqlat in the CPU post, look for
the second mode: cache hits cluster in microseconds, disk-backed
operations in milliseconds, and the ratio between the modes is your
cache behavior, drawn as a picture.
For workload characterization, the “what is actually being asked of this
file system” question that the methodology post says to ask first,
opensnoop shows every file being opened (chronically underrated;
applications open astonishing things at astonishing rates), and filetop
is a top for file reads and writes, which finds the hot files by name.
Watching the Cache Work
The page cache’s effectiveness is one number, hit ratio, and cachestat
prints it live. Here’s a controlled experiment from the book worth
replaying in your head: create a 1 GB file, flush and drop caches, then
read the file twice:
# cachestat -T 10
TIME HITS MISSES DIRTIES HITRATIO CACHED_MB
21:08:58 771 0 1 100.00% 190
21:09:08 33036 53975 16 37.97% 400
21:09:18 15 68544 2 0.02% 668
21:09:38 5 67424 0 0.01% 1187
21:10:08 268421 11 12 100.00% 1232
First read: misses pile up, CACHED_MB climbs by exactly the gigabyte as
the file pours into cache, hit ratio near zero. Second read at 21:10:08: a
quarter-million hits, nothing from disk, and the application just
experienced the same logical workload at a few hundred times the speed.
Same code, same file, same disk. The only variable is cache state.
That sensitivity is the operational point. A hit ratio that steps down and stays down means the working set outgrew the cache, or something else is competing for the memory: a deploy that grew the heap, another workload colocated onto the box. The symptom files as “disk got slow”; the cause lives in the memory post.
cachestathooks kernel internals that carry no stability promise. Its author describes it as a sand castle a kernel update can wash away. Tools built on kprobes (post two) age like this; the question they answer doesn’t.
A Working Method
Assembled into the order I actually use: characterize the workload first,
with filetop and opensnoop showing who’s touching what. That catches
the “surprise workload” class of problem (the backup, the log-spammer, the
accidental full-table scan) before any latency analysis. Then measure
operation latency with ext4slower and ext4dist against the complaint.
Got lines? The file system layer is implicated. Got silence? Look
elsewhere. Then check the cache with cachestat, because a hit-ratio
cliff explains slow operations without any disk involvement. Only after
all that, descend to the disks, which is exactly where the next post picks
up.
Why This Matters
The file system is the great launderer of I/O: it absorbs, defers, multiplies, and reorders everything passing through, which makes both layers of metrics liars in isolation. Disk metrics can’t name the victim, and application metrics can’t see the cause. Per-operation latency tracing at the file system boundary is the honest middle, and it turns “storage feels slow” from an opinion into a line of output with a process name, a file name, and a millisecond count on it.
Next post: down the stack at last. The block layer, what iostat’s
columns really mean, disk latency distributions, and when the disk
actually did do it.
References
- Gregg, B. (2020). Systems Performance: Enterprise and the Cloud, 2nd Edition. Addison-Wesley. Chapter 8.
- Gregg, B. (2019). BPF Performance Tools. Addison-Wesley. Chapter 8.