The Disk Didn't Do It: File System Latency Is What Your App Feels

When storage gets blamed for slowness, everyone’s first move is iostat, which is disk metrics. Reasonable instinct, wrong layer. Applications don’t perform disk I/O; they perform file system I/O. Between a read() and any platter or flash chip sit the VFS, the page cache, read-ahead, write-back, file system journaling, and a volume manager, each of which can absorb the I/O entirely, multiply it, defer it for thirty seconds, or hand the application a latency that has nothing to do with how the disk is feeling. The latency your application experiences is file system latency, and that’s the layer to measure first.¹²

Part five of the series. The page cache did a cameo in the memory post; here it’s a main character.

What Sits Between read() and the Disk

Every file operation enters through the VFS, the kernel’s abstraction that makes ext4, XFS, ZFS, NFS, and tmpfs all answer the same interface. Below it, the file system consults a stack of caches: the page cache for file contents, the dentry and inode caches for the name lookups and metadata that make open() cheap. A read that hits page cache returns in microseconds, DRAM speed, the months-to-minutes conversion from the memory post. A read that misses goes to the device and costs whatever the device costs.

Writes are stranger. By default they’re write-back: the application’s write() copies into the page cache, the page is marked dirty, and the call returns at memory speed. The actual disk write happens later, batched and coalesced, when the flusher threads get around to it. Tens of seconds later is normal. The application experiences a fast write that hasn’t happened yet; durability arrives asynchronously, unless it asks for fsync() and pays the real price. (This is also why write latency “to disk” measured at the application is often fiction in both directions: suspiciously fast when buffered, then horrifying when an fsync flushes a backlog.)

And reads are speculative: detect a sequential pattern and the file system prefetches ahead of the application, turning what would be a string of cache misses into hits. When read-ahead guesses right, it’s free performance. When it guesses wrong, on random access that looks briefly sequential, it’s the disk doing work nobody asked for.

Logical vs. Physical I/O, or Why iostat Can’t Convict Anyone

Add this up and you get the core idea of the chapter, the one worth keeping: the I/O your application issues (logical) and the I/O the disk sees (physical) are different events, related by bookkeeping you can’t see from either end alone. Gregg’s taxonomy of the mismatch:

Deflated: application I/O that shrinks or vanishes before the disk. Cache hits, writes to the same offset cancelled in cache, compression, tmpfs that never touches storage at all.

Inflated: disk I/O bigger than what the application asked for. Metadata to record where everything lives, record-size rounding, RAID parity, and journaling, which can double writes outright, once to the journal and once to the destination.

Indirect, implicit, unrelated: I/O shifted in time (write-back bursts, prefetch), I/O from mmap’d files where a load instruction triggers a disk read with no syscall anywhere in sight, and I/O that simply isn’t yours. Backups, RAID rebuilds, the other tenant on your cloud hypervisor.

Now look at iostat again: the disk is 80% busy with 400 writes/sec. Whose? The application wrote nothing in the last second. Is the burst write-back from a minute ago, journal traffic, a backup? The disk metrics can’t say. This is why descending to the disk layer first is an interrogation of the wrong witness. Measure where the application actually waits, then descend with a question instead of a suspicion.

Measuring What the Application Feels

The direct measurement is per-operation file system latency, and the BCC tools make it almost embarrassingly easy. ext4slower (and its siblings xfsslower, zfsslower, nfsslower) traces common operations including reads, writes, opens, and fsyncs, and prints only those slower than a threshold, 10 ms by default:

# xfsslower
Tracing XFS operations slower than 10 ms
TIME     COMM         PID    T BYTES   OFF_KB   LAT(ms) FILENAME
02:04:07 java         5565   R 63559   360237     17.16 shuffle_2_63762_0.data

Read that line: a specific process, a specific operation, a specific file, 17 milliseconds. This is evidence of the kind iostat structurally cannot produce, because it ties latency to a victim. Run it during the complaint window and you either get lines (the file system layer is implicated, and you know exactly where) or silence (it isn’t, and you just exonerated the entire storage stack in one command; go look at locks or the network). That binary outcome makes it one of the highest-value first moves in any “storage is slow” investigation.

For the distribution rather than the outliers, ext4dist gives latency histograms per operation type. As with runqlat in the CPU post, look for the second mode: cache hits cluster in microseconds, disk-backed operations in milliseconds, and the ratio between the modes is your cache behavior, drawn as a picture.

For workload characterization, the “what is actually being asked of this file system” question that the methodology post says to ask first, opensnoop shows every file being opened (chronically underrated; applications open astonishing things at astonishing rates), and filetop is a top for file reads and writes, which finds the hot files by name.

Watching the Cache Work

The page cache’s effectiveness is one number, hit ratio, and cachestat prints it live. Here’s a controlled experiment from the book worth replaying in your head: create a 1 GB file, flush and drop caches, then read the file twice:

# cachestat -T 10
TIME         HITS   MISSES  DIRTIES HITRATIO   CACHED_MB
08:58      771        0        1  100.00%         190
09:08    33036    53975       16   37.97%         400
09:18       15    68544        2    0.02%         668
09:38        5    67424        0    0.01%        1187
10:08   268421       11       12  100.00%        1232

First read: misses pile up, CACHED_MB climbs by exactly the gigabyte as the file pours into cache, hit ratio near zero. Second read at 21:10:08: a quarter-million hits, nothing from disk, and the application just experienced the same logical workload at a few hundred times the speed. Same code, same file, same disk. The only variable is cache state.

That sensitivity is the operational point. A hit ratio that steps down and stays down means the working set outgrew the cache, or something else is competing for the memory: a deploy that grew the heap, another workload colocated onto the box. The symptom files as “disk got slow”; the cause lives in the memory post.

cachestat hooks kernel internals that carry no stability promise. Its author describes it as a sand castle a kernel update can wash away. Tools built on kprobes (post two) age like this; the question they answer doesn’t.

A Working Method

Assembled into the order I actually use: characterize the workload first, with filetop and opensnoop showing who’s touching what. That catches the “surprise workload” class of problem (the backup, the log-spammer, the accidental full-table scan) before any latency analysis. Then measure operation latency with ext4slower and ext4dist against the complaint. Got lines? The file system layer is implicated. Got silence? Look elsewhere. Then check the cache with cachestat, because a hit-ratio cliff explains slow operations without any disk involvement. Only after all that, descend to the disks, which is exactly where the next post picks up.

Why This Matters

The file system is the great launderer of I/O: it absorbs, defers, multiplies, and reorders everything passing through, which makes both layers of metrics liars in isolation. Disk metrics can’t name the victim, and application metrics can’t see the cause. Per-operation latency tracing at the file system boundary is the honest middle, and it turns “storage feels slow” from an opinion into a line of output with a process name, a file name, and a millisecond count on it.

Next post: down the stack at last. The block layer, what iostat’s columns really mean, disk latency distributions, and when the disk actually did do it.

References

Gregg, B. (2020). Systems Performance: Enterprise and the Cloud, 2nd Edition. Addison-Wesley. Chapter 8. ↩
Gregg, B. (2019). BPF Performance Tools. Addison-Wesley. Chapter 8. ↩