TCP/IP Troubleshooting. How to detect issues at the transport layer with Wireshark.
TCP/IP is both a protocol suite( a set of protocols used on the Internet an other networks alike ), and a couple of specific protocols: TCP an IP.
As a protocol suite: TCP/IP contains a lot of protocols like HTTP, TELNET, FTP, ARP, DNS, etc, but it has taken its name from the most two famous protocols within the suite: TCP as a transport protocol, and IP as a network protocol.
It simply happened that the TCP protocol and the IP protocol were the rock stars of the whole protocol stack, so the whole set of protocols was named after them in a name that combines both protocols in a single word: TCP/IP.
To further clarify, let´s make a musical analogy: if you are in a band where your lead singer is “John Bon Jovi”, and even when there are 13 musicians more in the band, you cannot be surprised if the whole band ends up being called “Bon Jovi”, aren´t you? That is what it happened actually to TCP/IP as well. TCP and IP as individual protocols were the lead singers of the suite of protocols, so the whole band was named after them (TCP/IP), even when there were many other (excellent) protocols playing on the protocol suite.
In summary:
TCP/IP refers to a suite of protocols used on the Internet that includes a transport protocol named TCP, and a network protocol named IP, but also includes many other protocols. The whole stack of protocols was named TCP/IP just because TCP an IP were the most famous protocols of the group. However when we say TCP/IP, we are referring to the suite of protocols that empower Internet, rather than to the individual TCP and IP protocols.
TCP/IP is sometimes called IPS (Internet Protocol Suite) which for me is certainly a better name, but for some reason less used. What can we do? After all, the world is not a perfect place, isn´t it?
So far so good. Once the differences between TCP, IP and TCP/IP have been clarified, we will narrow down the scope of this article.
In this article we are focusing on TCP as an independent transport protocol.
Because TCP/IP protocol suite uses a layered model where each layer serves the purpose of the layer above (see picture below), one good way of detecting issues in a TCP/IP network is to focus on the transport layer (the TCP protocol), and that is exactly what we are going to do here.
At each layer the PDU (Protocol Data Unit) has a different name, being “segment” the name of the PDU at the transport layer (see below):
LAYER # - OSI NAME - COMMON PROTOCOL OR USE - PDU NAME
------- ------------ ---------------------- --------------------------
Layer 1 - Physical - Transceiver - bits, or a physical signal
Layer 2 - Datalink - Ethernet - frame
Layer 3 - Network - IP - packet
Layer 4 - Transport - TCP - segment
Layer 5 - Session - SIP - data, request, or response
Layer 6 - Presentation - Encryption/compression - data, request, or response
Layer 7 - Application - HTTP - data, request, or response
The reason for focusing on the TCP protocol (analysis of segments) is simple. If there is a problem in the transport layer, you do not need to further investigate the application layer (something that most of the times requires knowing detailed specifications of the protocol that you are troubleshooting), like knowing the specific details of protocols like SMB, NFS, HTTP or others.
Is therefore simpler to start seeking for issues at the transport layer. If nothing is wrong at the transport layer, we will need to move up our troubleshooting process to the application layer. On the other side, if something is wrong at the transport layer, we do not need to troubleshoot the upper layer, as we can stick to the simpler transport layer and fix the underlying issue there.
But… What are the basic concepts to know for troubleshooting TCP effectively?
First of all we need to know how a TCP segment looks like. Thanks to the Wikipedia, we have this picture of the header of a TCP segment:
The goal of the TCP protocol is to provide a connection oriented communication ensuring reliable, ordered and error-checked delivery of streams of bytes. Therefore each of the fields of the TCP header has a mission related to that goal.
Let´s explain the basic fields:
Source port and destination port: source port and destination port allows us to identify the service that the data must be sent to (destination port) and from where the data is going to be sent from (source port). Because a single host offers various services such as http, ftp, telnet, etc, all clients connecting must use a destination port number to indicate which service they are addressing. The services listening in each port number are registered in IANA.
The sequence number is important for in order reliable delivery. Sequence numbers are implemented as a 32 bits number. A TCP communication can be seen as two communication streams, one from source to destination,and the other from destination to source. Source and destination maintain their own sequence numbers, each keeping a sequence number from his side of the communication. Both sides use sequence numbers and acknowledge numbers to keep track of the conversation, and to advance or move back the conversation as required.
ACK number is a 32 bits number that acknowledges the reception of all the previous bytes of information. It tells the sender until which byte the information has been received properly, so the flow of information can happen smoothly. TCP is a cumulative acknowledge system, which can only use a single number to acknowledge data received. That number is the last contiguous byte in the stream successfully received.
TCP flags are 1 bit containing crucial information. The most important of the TCP flags are:
ACK flag if the ACK bit flag is set, it means that the ACK sequence number field is relevant and contains ACK information.
SYN flag is used during the session establishment to agree on initial sequence numbers. TCP is connection oriented, so session setup is required to agree on the sequence numbers, which are random numbers to avoid easy attacks and spoofing.
FYN flag is used during a graceful session close to show that the sender has no more data to send.
RST flag is the reset flag, and instantaneous abort (normally an abnormal session disconnection).
PSH flag is the PUSH flag. If this flag is set, then it pushes (forces) the data delivery without waiting for the buffers to fill. In other words if the PSH flag is set, the sending application sends the data immediately without waiting for the buffer to fill. It also tells the destination side of the communication that the received data needs to be delivered to the application immediately (no buffering).
In the following picture you can see how the normal flow of sending and receiving information works.
As you can see in both sides, there is a buffer involved. Setting the PSH flag is a way to bypass those buffers. When we use, for example, a Telnet application, the PSH flag is being used to ensure that the data is transmitted immediately.
Now that we have already covered the basic details of the TCP flags, we can move forward. What else do we have in the TCP header that is relevant for our purpose?
Window Size is a very important 2 bytes field in the TCP header. The TCP window size, or as some call it, the TCP receiver window size, is simply an advertisement of how much data (in bytes) the receiving device is willing to receive at any point in time. The receiving device can use this value to control the flow of data, or as a flow control mechanism.
When the window size is 0, it simply means that the computer at the communication side advertising that Window size = 0, is overwhelmed and cannot cope with more data. In other words: the application cannot read fast enough information from the buffer, and the buffer has become full. The communication needs to be temporary stopped because the receiver needs a break.
Hopefully we have now the basic concepts clear. Is time to start seeing what are the situations in the transport protocol that can be causing problems to the overall communication.
Zero Window: Means that the window size of the computer advertising the “zero window” has become 0, and therefore cannot cope with the amount of data that the other side of the communication is sending. If this happens, there is a delay, as the sender needs to wait until some room for new data is available at the receiver side before continuing sending information.
The filter that can be used in Wireshark to detect a zero window condition is tcp.analysis.zero_window
The presence of a high number of “zero window”, is a clear indication that the receiver cannot keep the rhythm of the sender. Increasing the TCP buffers on the receiver is likely required in those scenarios, or reducing the amount of data being sent from the sender to the receiver.
Duplicated ACKs, out of order packets and re transmissions: Duplicated ACKs are normally an indicator of missing packets. When a receiver receives a packet with a higher sequence number than the expected, it proceeds as if some data was dropped. To make the sender aware of the apparently dropped data as quickly as possible, the receiver immediately sends an acknowledgment (ACK) with the sequence number set to the expected sequence number.
That is a dup ack. Is the receiver side of the communication detecting a gap in the sequence number and telling to the sender “you are sending me a segment that contains a sequence number higher than the one I am expecting… could you please resend me the one I am expecting”
So the receiver detects a gap in the sequence number, and generates a duplicated ACK (an ACK with an ACK sequence number equal to the sequence number expected, which is lower than the latest sequence number received from the sender).
The receiver keeps sending dup acks for each subsequent segment it receives on that connection, until the missing segment is successfully received because it naturally arrives by normal means or because it is re transmitted.
Bear in mind that duplicated acks can be normal as they also identify packets that arrive out of order (specifically with a higher sequence number than the one expected), so not all the duplicated acks end up with a re transmission or indicate an issue.
Sometimes after one or two dup acks, the right TCP segment is received. It just arrives with a delay, but there is no packet loss and no need to re transmit, as TCP is prepared for reordering.
If the sender has “fast retransmission” enabled, once it receives 3 duplicated acks, it will resend the missing segment. A great number of duplicated acks can be clear indicator of dropped or missing segments or an indicator of reordering of packets above a healthy level. The way to determine if the dup acks end up with a retransmission is noticing the number of dup acks. For example dup ack #1,dup ack #2, and so on. If fast retransmissions are enabled, then after 3 DUP ACKS are received by the sender, TCP performs a retransmission of that segment without waiting for the expire of the re transmission timer. This is known as fast retransmit because it happens before the re transmission timer expires naturally.
A high number in the dup ack, like dup ack #20, is normally a sign of issues, because fast re transmissions are not enabled, and a lot of out of orders are being received.
You can filter duplicated acks in Wireshark using the filter: tcp.analysis.duplicate_ack
Additionally you can spot re transmissions in Wireshark with the following filters:
tcp.analysis.fast_retransmissions → re transmission triggered by the reception of 2 dup acks. Fast retransmissions.
tpc.analysis.retransmission → re transmission triggered by the expiration of the re transmission timer.
The retransmission rate of traffic should not exceed 2%. If the rate is higher, the user experience of your service may be affected.
How to spot if we are receiving too many out of order packets of if it is “normal”?
The TCP protocol was designed to deal with out of order packets, but as TCP only passes the data up to the application when all the received bytes are in order, then if there are many out of order packets, or if they arrive with a high delay, there would be a degradation in performance.
That is why there is a Wireshark filter to identify out of order packets:
tcp.analysis.out_of_order
Wireshark marks a packet as out of order based on the fact that (a) it contains data, (b) does not advance the sequence number value, meaning that it is a packet that has arrived after another one that had a higher sequence number, and (c) arrives within 3 ms of the highest sequence number seen.
On average around a 3% of the packets are out of order (this can be considered normal). If the out of order packets are greater than 3%, it can cause a performance issue, and requires further investigation.
CONCLUSIONS:
On this article I wanted to give notoriety to the most common issues at the transport layer.
Zero Window being advertised by the receiver.
Re transmissions exceeding 2% of the total amount of TCP segments.
Out of order TCP segments greater than 3%
I hope you have enjoyed the content of this article, and that you have learnt something useful while reading it.