Maybe It Was the Network After All: TCP Latency With Evidence

The first post in this series made fun of “maybe it’s the network, can you check with the network team?” as the canonical anti-method: a hypothesis with no data attached. The joke has a second half, though. Sometimes it is the network, and the difference between the anti-method and a finding is whether you show up with retransmit traces or with vibes. This post is about generating the traces.¹²

The network deserves its reputation for slipperiness. It’s the one resource in this series that isn’t yours: the interesting failures happen in switches you can’t log into, on paths you share with strangers, with TCP doing adaptive things on your behalf that change second to second. What you can observe, completely, are the endpoints, and the endpoints know almost everything worth knowing.

Which Latency Do You Mean?

“Network latency” is four different numbers, and conflating them wastes investigation time. Name resolution comes first and is forgotten first. A slow or flaky DNS server taxes every connection made without a cached answer, and files its symptoms under everybody else’s name. Ping latency is the raw round trip, useful as a floor. Connection latency is the TCP handshake, client’s SYN out to SYN-ACK back, which exercises the kernel’s accept path, not just the wire. And first-byte latency is connection plus the time the far end took to wake a thread and produce a response, which means it measures the remote host’s scheduler and run queues (post three) as much as any network.

The decomposition matters because each number points somewhere different. Ping slow: the path. Connect slow but ping fine: the remote host’s accept path or a dropped SYN. First-byte slow but connect fine: the remote application. Three commands in, you already know which team to talk to, with numbers.

Buffers All the Way Down

Latency on the network path mostly hides in queues, same as everywhere else in this series. The only novelty is how many there are. Outbound, your data waits in the socket send buffer, then the qdisc (the kernel’s traffic-control queue), then the driver ring, then in every switch and router buffer along the path. Each one trades latency for throughput.

The pathological version has a name: bufferbloat.³ Network gear with oversized buffers queues packets for long intervals instead of dropping them. TCP relies on timely loss signals for congestion control, so the delayed feedback makes it misjudge the path, and latency climbs for everyone. The fixes that landed in Linux (the CoDel queueing discipline, byte queue limits, TCP small queues) all amount to the same principle: keep queues short and honest, drop early, let the endpoints adapt.

One queue deserves special respect on servers: the connection backlog, where completed handshakes wait for the application to accept(). When the application can’t keep up, the backlog fills and the kernel drops SYNs, which the client retransmits seconds later. Backlog drops are an unambiguous host-overload signal, and they manufacture multi-second connection latencies out of a host that’s merely busy. The drop counters live in nstat, and tcpaccept-style tracing puts process names on them.

Counters First, As Always

Before tracing, the sixty-second-checklist layer, because the kernel’s TCP counters are free and two of them carry most of the signal:

$ sar -n TCP,ETCP 1
    active/s passive/s    iseg/s    oseg/s
        1.00     12.00   4623.00   5126.00
  atmptf/s  estres/s retrans/s isegerr/s   orsts/s
      0.00      0.00      1.00      0.00      2.00

active/s and passive/s are outbound and inbound new connections, your connection workload at a glance, and a sudden jump in either is workload characterization handed to you for free. retrans/s against oseg/s gives the retransmit rate: a fraction of a percent is life on a shared network; percents are a problem. ss -ti adds per-socket truth (RTT estimates, congestion window, retransmit counts per connection) when you need to zoom from “the host retransmits” to “this flow suffers.”

Who Talks to Whom: tcplife

For the tracing layer, tcplife logs one line per TCP session at the moment it closes, with endpoints, bytes each way, and lifespan:

# tcplife
PID   COMM  LADDR          LPORT RADDR          RPORT TX_KB RX_KB    MS
4169  java  100.1.111.231  32648 100.2.0.48      6001     0     0  3.99
4169  java  100.1.111.231  32650 100.2.0.48      6001     0     0  4.10
4169  java  100.1.111.231  40158 100.2.116.192   6001     7    33  3590.91
4169  java  100.1.111.231  56940 100.5.177.31    6101     0     0  2.48

(The tool exists because Julia Evans tweeted a wish for it; Gregg built it the same year. The good flashlights get built this way.)

Read the shape of that output, not just the lines. Most sessions here moved zero kilobytes and lived a few milliseconds: connect, nothing, close. That’s empty connection churn. Health checks, or a client opening a fresh connection per request because nobody configured connection pooling, each request paying the full handshake tax (plus TLS, in real deployments) to move no data. One session lived 3.6 seconds and actually transferred bytes; that’s the real workload, swimming in overhead. This single tool answers the network half of workload characterization, who connects to whom, how often, for how long, moving how much, and it does it with in-kernel efficiency, instrumenting TCP state changes rather than every packet. Sniffing pays per-packet; tcplife pays per-session.

Catching the Network Red-Handed: tcpretrans

Retransmits are the closest thing endpoints have to direct evidence of network trouble, and tcpretrans prints each one as it happens, no packet capture required:

# tcpretrans
Tracing retransmits ... Hit Ctrl-C to end
TIME     PID    IP LADDR:LPORT        T> RADDR:RPORT        STATE
00:20:11 72475  4  100.1.58.46:35908  R> 100.2.0.167:50010  ESTABLISHED
00:20:12 60695  4  100.1.58.46:52346  R> 100.2.6.189:50010  ESTABLISHED
00:20:13 60695  6  ::ffff:100.1.58.46:13562 R> ::ffff:100.2.51.209:47356 FIN_WAIT1

The STATE column is the diagnostic part. Retransmits on ESTABLISHED sessions point at the path: loss or congestion on connections that were working. A pile-up in SYN_SENT points at the far host: SYNs going unanswered, which smells like the backlog drops from earlier, on someone else’s server. Same tool, two different accusations, each with an address and port attached.

And recall the scale. On post one’s human-scaled latency table, a TCP timer-based retransmit was one to three centuries. Each line above is a connection that stalled for an RTO, likely one to three real seconds, while the application above it saw nothing but a hung request. A fraction-of-a-percent retransmit rate, invisible in averages, is exactly the kind of thing that owns your p99. If your tail latency has a mysterious one-second mode, this is the first tool I’d run.

Why This Matters

Four latencies instead of one, queues at every hop, and two tracing tools that replace “maybe it’s the network” with an address, a port, a state, and a timestamp. That’s the difference between routing a ticket and making a finding. When the evidence says ESTABLISHED retransmits to one subnet, the network team gets a question they can actually answer.

That closes the tour: method, instrumentation, then CPUs, memory, file systems, disks, and the wire. The same three questions at every layer (errors? saturation? utilization?), the same preference for distributions over averages, latency as the common currency throughout. The two Gregg books below are each eight hundred pages deeper than this series at every single layer; if any post here earned a shrug of “surely it’s more complicated than that,” then yes, and that’s where it’s written down. Go search under better streetlights.

References

Gregg, B. (2020). Systems Performance: Enterprise and the Cloud, 2nd Edition. Addison-Wesley. Chapter 10. ↩
Gregg, B. (2019). BPF Performance Tools. Addison-Wesley. Chapter 10. ↩
Nichols, K., Jacobson, V. (2012). Controlling Queue Delay. ACM Queue 10(5), the CoDel paper. ↩