When the Disk Did Do It: Block I/O Latency Without the Averages
Last post argued for measuring file system latency before blaming disks.
Sometimes you do that and the disks stand implicated anyway: ext4slower
prints lines, the cache hit ratio is fine, and the slow operations all
bottom out in device I/O. Now the descent to the block layer is justified,
and you arrive with a question (“why did these reads take 17 ms?”) instead
of a suspicion. This post is about answering it: what the block layer does
to your I/O on the way down, what iostat’s columns honestly mean, and why
per-I/O tracing beats every average on this layer in particular.
Part six of the series, and the layer where the latency numbers get big enough to matter on a human scale. Recall from post one that a rotational disk I/O is months to the CPU’s seconds.
What the Block Layer Does to Your I/O
Between the file system and the device driver, the kernel runs a
sophisticated middleman. It merges adjacent requests, so sequential reads
arriving separately leave as one larger I/O. It queues and schedules:
since Linux 5.0 the multi-queue (blk-mq) design is the only one, born
because a single shared request queue with a single lock became its own
bottleneck once devices got fast enough to embarrass it. Per-CPU submission
queues feed per-device dispatch queues, and a pluggable scheduler decides
ordering: none for fast NVMe that needs no help, mq-deadline to cap
starvation, bfq for fairness on desktops, kyber for latency targets.
Two consequences for analysis. First, the I/O count and sizes at this layer won’t match the file system’s. Merging deflates counts; metadata and RAID inflate them (a small RAID-5 write can trigger a read-modify-write cycle: read the stripe, recompute parity, write it back, and your 4 KB write became multiple I/O plus math). Second, latency at this layer has two components, time queued in the OS and time on the device. Conflating them points fixes in the wrong direction, and most tools conflate them.
Reading iostat Honestly
iostat -sxz 1 is line six of the 60-second checklist, and three of its
columns do the heavy lifting: await (average time per I/O, queue plus
service), aqu-sz (average queue length), and %util (fraction of time
the device was doing anything).
Each needs an asterisk. await is an average, and this post’s whole
argument is that block I/O averages hide the story. aqu-sz above zero
sustained means queueing, which is your saturation signal. And %util,
the most-quoted number, is the least trustworthy:
It’s an interval summary, so bursty I/O (write-back flushes from the file
systems post arrive precisely in bursts) can hide a saturated second
inside a calm-looking average. Worse, on anything virtual the number can
be outright misleading. A “disk” that’s really a RAID volume or cloud
block storage reports 100% busy when some underlying device was always
busy, while other spindles sit idle and the volume could absorb plenty
more. The reverse error happens too: a controller with a write-back cache
acks writes instantly, showing a nearly idle device while the physical
disks behind it churn for seconds afterward. On modern infrastructure,
treat %util as a hint and queue depth plus latency as the facts.
Distributions, Not Averages
The case for histograms is stronger at the block layer than anywhere else
in this series, because device latency is structurally multimodal: an
on-device cache hit and a seek are different physical events with
different costs, and an SSD quietly doing internal garbage collection
produces a third population. An average melts all of them into a number
that describes none of them. biolatency instruments block I/O completion
and shows the shape. Here, ten seconds on a production Hadoop instance:
# biolatency 10 1
Tracing block device I/O... Hit Ctrl-C to end.
usecs : count distribution
64 -> 127 : 15 | |
128 -> 255 : 4475 |************ |
256 -> 511 : 14222 |****************************************|
512 -> 1023 : 12303 |********************************** |
1024 -> 2047 : 5649 |*************** |
2048 -> 4095 : 995 |** |
4096 -> 8191 : 1980 |***** |
8192 -> 16383 : 3681 |********** |
16384 -> 32767 : 1895 |***** |
Two populations in plain sight: the bulk completing in 128–2048 µs, and a second hump out at 8–32 milliseconds. The mean of this distribution is about a millisecond and a half, a number that describes neither the fast mode nor the slow one, and would have told you the disk was fine. The second hump is thousands of I/O per ten seconds taking 10–100× longer than the rest; if a latency-sensitive application sits upstream, that hump is the complaint.
Add -Q and the histogram includes OS queue time, not just device time.
Comparing the two runs splits the diagnosis cleanly. Device time high: the
hardware (or the virtual hardware) is slow. Queued time high: the device
is fine and you’re drowning it, which is a workload or scheduler problem,
not a procurement problem.
Naming Names: biosnoop
The histogram says what shape; biosnoop says who, one line per I/O:
# biosnoop
TIME(s) COMM PID DISK T SECTOR BYTES LAT(ms)
0.000000 java 5136 xvdq R 980043184 45056 0.35
0.143724 java 5136 xvdy R 5153784 45056 1.08
0.189267 java 5136 xvdy R 979232832 45056 14.00
0.190330 java 5136 xvdy R 979232920 45056 15.05
0.190376 java 5136 xvdy R 979233008 45056 15.09
0.190441 java 5136 xvdy R 979233272 36864 15.15
Look at the pattern in those slow lines, because it’s a classic: a run of
reads on xvdy at consecutive sectors, all completing around 15 ms,
issued microseconds apart. That’s not fifteen independent slow I/O. It’s
a queue draining. The first I/O in the batch ate the real device latency
and everything behind it inherited the wait. biosnoop -Q makes the split
explicit with separate queue-time and device-time columns; in one of the
book’s examples, a cksum process shows 17–35 ms of queue time against
1–8 ms of device time, a device performing fine behind a backlog.
This is also the tool for the time-honored colocation whodunit: sort the output by latency, look at what was happening when the outliers fired, and more than once the answer is another tenant’s backup job, a RAID rebuild generating I/O no application issued, or the flusher threads dumping a minute of write-back in one burst. Every one of them is an “unrelated I/O” case from last post’s taxonomy, finally caught holding the queue.
Why This Matters
The block layer is where averages go to lie: multimodal device behavior,
queueing that inherits one slow I/O’s latency across a whole batch, and
utilization numbers that virtualization has reduced to vibes. The working
method is short. Confirm the complaint is device-side (biolatency,
compare with and without -Q), identify the population (which mode, queue
or device), then name the I/O (biosnoop). It ends investigations that
%util dashboards can only prolong.
Next post, the last layer: the network, where latency stops being a local matter, TCP gets opinions about your traffic, and retransmit timers from post one’s table (centuries, at human scale) come for your tail latencies.
References
- Gregg, B. (2020). Systems Performance: Enterprise and the Cloud, 2nd Edition. Addison-Wesley. Chapter 9.
- Gregg, B. (2019). BPF Performance Tools. Addison-Wesley. Chapter 9.