In an effort to rid the world of needless application and network performance slowdowns, we turn to retransmission timeouts (RTOs). What are they and what can you do about them?
What Is TCP Retransmission?
TCP (the Transmission Control Protocol) connects network devices to the internet. When an outbound segment is handed down to an IP and there's no acknowledgment for the data before TCP's automatic timer expires, the segment is retransmitted. This actually happens all the time, and typically doesn't cause much of a problem: as the retransmission timer counts down, the packets are resent, and the network continues to hum along.
So What Are Retransmission Timeouts?
A retransmission timeout (RTO), on the other hand, is quite a different beast. An RTO occurs when the sender is missing too many acknowledgments and decides to take a time out and stop sending altogether. After some amount of time, usually at least one second, the sender cautiously starts sending again, testing the waters with just one packet at first, then two packets, and so on.
As a result, an RTO causes, at minimum, a one-second delay on your network. We've seen sites that show millions of RTOs in a 24-hour window, with one million RTOs translating to 277 hours of application delay. These retransmission timeouts add up to significant problems for network and application performance and certainly require some tuning and optimization.
How Can I Eliminate RTOs?
One way to spot RTOs is to simulate the TCP state machines at their endpoints, and then infer when problems occur in order to detect issues like bad congestion avoidance, Nagle delays, PAWS drops, and excessive tinygrams. (Click those links to learn more about each of these common network issues!)
Another method is to access your environment's wire data, or all the communications on the network itself. By tapping into your wire data, you can track RTO metrics and correlate them with traffic spikes in order to quickly figure out what's causing the timeouts. If you can immediately identify where you're losing packets and locate the congested links, pinpointing and fixing the root cause becomes a lot easier.
We actually wrote up a TCP optimization guide to compile some tips and tricks for just this kind of problem, so grab a copy of Optimizing TCP: Nagle's Algorithm and Beyond to start taking advantage of what we've learned over years of delivering real-time network analytics.
What Are Common Causes of RTOs?
If you can't see into your network traffic, we recommend you start by taking a look at these common causes of retranmission timeouts:
- Duplex mismatch on the switch
- A bad cable
- Bad checksums
- Driver issues
While you're thinking about improving your network performance, here are a few of our favorite tips and hacks: