When the Disk Did Do It: Block I/O Latency Without the Averages

Last post argued for measuring file system latency before blaming disks. Sometimes you do that and the disks stand implicated anyway: ext4slower prints lines, the cache hit ratio is fine, and the slow operations all bottom out in device I/O. Now the descent to the block layer is justified, and you arrive with a question (“why did these reads take 17 ms?”) instead of a suspicion. This post is about answering it: what the block layer does to your I/O on the way down, what iostat’s columns honestly mean, and why per-I/O tracing beats every average on this layer in particular.¹²

Part six of the series, and the layer where the latency numbers get big enough to matter on a human scale. Recall from post one that a rotational disk I/O is months to the CPU’s seconds.

What the Block Layer Does to Your I/O

Between the file system and the device driver, the kernel runs a sophisticated middleman. It merges adjacent requests, so sequential reads arriving separately leave as one larger I/O. It queues and schedules: since Linux 5.0 the multi-queue (blk-mq) design is the only one, born because a single shared request queue with a single lock became its own bottleneck once devices got fast enough to embarrass it. Per-CPU submission queues feed per-device dispatch queues, and a pluggable scheduler decides ordering: none for fast NVMe that needs no help, mq-deadline to cap starvation, bfq for fairness on desktops, kyber for latency targets.

Two consequences for analysis. First, the I/O count and sizes at this layer won’t match the file system’s. Merging deflates counts; metadata and RAID inflate them (a small RAID-5 write can trigger a read-modify-write cycle: read the stripe, recompute parity, write it back, and your 4 KB write became multiple I/O plus math). Second, latency at this layer has two components, time queued in the OS and time on the device. Conflating them points fixes in the wrong direction, and most tools conflate them.

Reading iostat Honestly

iostat -sxz 1 is line six of the 60-second checklist, and three of its columns do the heavy lifting: await (average time per I/O, queue plus service), aqu-sz (average queue length), and %util (fraction of time the device was doing anything).

Each needs an asterisk. await is an average, and this post’s whole argument is that block I/O averages hide the story. aqu-sz above zero sustained means queueing, which is your saturation signal. And %util, the most-quoted number, is the least trustworthy:

It’s an interval summary, so bursty I/O (write-back flushes from the file systems post arrive precisely in bursts) can hide a saturated second inside a calm-looking average. Worse, on anything virtual the number can be outright misleading. A “disk” that’s really a RAID volume or cloud block storage reports 100% busy when some underlying device was always busy, while other spindles sit idle and the volume could absorb plenty more. The reverse error happens too: a controller with a write-back cache acks writes instantly, showing a nearly idle device while the physical disks behind it churn for seconds afterward. On modern infrastructure, treat %util as a hint and queue depth plus latency as the facts.

Distributions, Not Averages

The case for histograms is stronger at the block layer than anywhere else in this series, because device latency is structurally multimodal: an on-device cache hit and a seek are different physical events with different costs, and an SSD quietly doing internal garbage collection produces a third population. An average melts all of them into a number that describes none of them. biolatency instruments block I/O completion and shows the shape. Here, ten seconds on a production Hadoop instance:

# biolatency 10 1
Tracing block device I/O... Hit Ctrl-C to end.

     usecs               : count      distribution
        64 -> 127        : 15       |                                        |
       128 -> 255        : 4475     |************                            |
       256 -> 511        : 14222    |****************************************|
       512 -> 1023       : 12303    |**********************************      |
      1024 -> 2047       : 5649     |***************                         |
      2048 -> 4095       : 995      |**                                      |
      4096 -> 8191       : 1980     |*****                                   |
      8192 -> 16383      : 3681     |**********                              |
     16384 -> 32767      : 1895     |*****                                   |

Two populations in plain sight: the bulk completing in 128–2048 µs, and a second hump out at 8–32 milliseconds. The mean of this distribution is about a millisecond and a half, a number that describes neither the fast mode nor the slow one, and would have told you the disk was fine. The second hump is thousands of I/O per ten seconds taking 10–100× longer than the rest; if a latency-sensitive application sits upstream, that hump is the complaint.

Add -Q and the histogram includes OS queue time, not just device time. Comparing the two runs splits the diagnosis cleanly. Device time high: the hardware (or the virtual hardware) is slow. Queued time high: the device is fine and you’re drowning it, which is a workload or scheduler problem, not a procurement problem.

Naming Names: biosnoop

The histogram says what shape; biosnoop says who, one line per I/O:

# biosnoop
TIME(s)     COMM     PID    DISK    T SECTOR     BYTES    LAT(ms)
000000    java     5136   xvdq    R 980043184  45056      0.35
143724    java     5136   xvdy    R 5153784    45056      1.08
189267    java     5136   xvdy    R 979232832  45056     14.00
190330    java     5136   xvdy    R 979232920  45056     15.05
190376    java     5136   xvdy    R 979233008  45056     15.09
190441    java     5136   xvdy    R 979233272  36864     15.15

Look at the pattern in those slow lines, because it’s a classic: a run of reads on xvdy at consecutive sectors, all completing around 15 ms, issued microseconds apart. That’s not fifteen independent slow I/O. It’s a queue draining. The first I/O in the batch ate the real device latency and everything behind it inherited the wait. biosnoop -Q makes the split explicit with separate queue-time and device-time columns; in one of the book’s examples, a cksum process shows 17–35 ms of queue time against 1–8 ms of device time, a device performing fine behind a backlog.

This is also the tool for the time-honored colocation whodunit: sort the output by latency, look at what was happening when the outliers fired, and more than once the answer is another tenant’s backup job, a RAID rebuild generating I/O no application issued, or the flusher threads dumping a minute of write-back in one burst. Every one of them is an “unrelated I/O” case from last post’s taxonomy, finally caught holding the queue.

Why This Matters

The block layer is where averages go to lie: multimodal device behavior, queueing that inherits one slow I/O’s latency across a whole batch, and utilization numbers that virtualization has reduced to vibes. The working method is short. Confirm the complaint is device-side (biolatency, compare with and without -Q), identify the population (which mode, queue or device), then name the I/O (biosnoop). It ends investigations that %util dashboards can only prolong.

Next post, the last layer: the network, where latency stops being a local matter, TCP gets opinions about your traffic, and retransmit timers from post one’s table (centuries, at human scale) come for your tail latencies.

References

Gregg, B. (2020). Systems Performance: Enterprise and the Cloud, 2nd Edition. Addison-Wesley. Chapter 9. ↩
Gregg, B. (2019). BPF Performance Tools. Addison-Wesley. Chapter 9. ↩