TCP: Let the flowlets flow!

ajay kulkarni
10 min readMar 5, 2024

--

Before you read this part please glance over https://medium.com/@ajaykul/data-center-networking-part-i-511b6cfcc8fe

Transmission Control Protocol (TCP) primer

On your system open a terminal (or on windows type “cmd” in run: command prompt) and type “netstat -p TCP”:

kulkarnia@kulkarnia Downloads % netstat -p TCP
Active Internet connections
Proto Recv-Q Send-Q Local Address Foreign Address (state)
tcp4 0 0 kulkarnia-mbp.jn.55171 eng-eye-p1-qnc-e.https ESTABLISHED
tcp4 0 0 kulkarnia-mbp.jn.55170 104.208.16.91.https ESTABLISHED
tcp4 0 0 kulkarnia-mbp.jn.55168 ec2-18-204-85-51.https ESTABLISHED
tcp4 0 0 kulkarnia-mbp.jn.55153 sfo03s27-in-f2.1.https ESTABLISHED
tcp4 0 0 kulkarnia-mbp.jn.55152 sfo03s27-in-f2.1.https ESTABLISHED
tcp4 0 0 kulkarnia-mbp.jn.55151 sfo03s25-in-f1.1.https ESTABLISHED
tcp4 0 0 kulkarnia-mbp.jn.55148 172.217.164.99.https ESTABLISHED
tcp4 0 0 kulkarnia-mbp.jn.55147 sfo03s26-in-f1.1.https ESTABLISHED
tcp4 0 0 kulkarnia-mbp.jn.55145 sfo03s32-in-f2.1.https ESTABLISHED
tcp4 0 0 kulkarnia-mbp.jn.55141 nuq04s39-in-f14..https ESTABLISHED
tcp4 0 0 kulkarnia-mbp.jn.55140 172.64.155.119.https ESTABLISHED
tcp4 0 0 kulkarnia-mbp.jn.55138 nuq04s42-in-f10..https ESTABLISHED
tcp4 0 0 kulkarnia-mbp.jn.55137 104.18.131.236.https ESTABLISHED
tcp4 0 0 kulkarnia-mbp.jn.55136 sfo03s32-in-f8.1.https ESTABLISHED

tcp4 0 0 kulkarnia-mbp.jn.55091 20.40.202.2.https CLOSE_WAIT

This lists all the TCP connections made from your system (Local Address)to a remote address (Foreign Address). But what is it after all?

For communications a computer system typically uses a 7 layer model starting from the lowest Physical layer (Physical cable and putting raw bits) to the top Application layer. This 7 layer model is a standard and all devices and providers rely on this.

TCP layer sits in the layer 4 in this 7 layer stack on top of the IP layer (layer 3). The IP or the IP address (for example 10.2.3.4) identifies the device, as each device has a unique IP. IP layer (layer 3) is the addressing system of the internet and the core function of the IP layer is to route a packet from one device to the other. Consider laying the roads and the pathways, that is what the IP layer does with routing and identifying a device.

IP layer is a connectionless protocol, meaning each packet is stamped with the destination address and is routed individually although it belongs with a batch of other packets. When the packet reaches the destination no acknowledgement is sent by the destination to the source. So the source has no way of knowing if the packet made it to the intended destination.

This is where TCP protocol comes in. TCP provides an acknowledgement scheme where by each packet received is acknowledged back to the source. Also TCP provides the flow control and congestion control. Thus TCP provides reliability and redundancy to the network traffic.

Now consider each device might run multiple applications and when you send your data you need to make sure the right application receives the data. For this you make use of the port numbers. For well established applications the port numbers are fixed. For instance HTTP protocol used for browser uses fixed port 80.

In short the two IP addresses of the connections along with the protocol and the ports form a unique pair. This is called a 5-tuple: source ip, source port, destination ip, destination port along with protocol. This is showed in the following picture obtained from a wireshark capture of traffic flowing through my device. The Red arrows mark the 5-tuple fields in the TCP protocol. This 5-tuple will come in use later.

TCP flow control

Let us also discuss in brief the TCP flow control mechanism. As soon as the TCP connection is made, both the receiver and the sender creates a buffer to receive and transmit. Included in every acknowledgement is the receiver side and the sender side advertises the amount of buffer space on their ends. This helps in ensuring that no side sends data that can overwhelm the other end. This entire mechanism is called the sliding window protocol or process.

TCP Congestion control

The medium or the path on which the packets are sent are shared with a lot of senders transmitting at a same time. This is analogous to the highway where the road is shared by numerous cars. During rush hour a lot of cars ply on the road causing a buildup and huge backlog. However, unlike the highway system once you reach the buffer limit the packets are simply dropped without any acknowledgement. The sender waits for the acknowledgement and after a while retransmits the packet. As more and more senders do this the congestion builds up and more packets get dropped causing the senders to pump even more at a time when they should not transmit and let the network heal. This leads to congestion control mechanism.

There are three phases here for congestion control: start slow, try to avoid congestion and keep on a lookout (detection phase). Every time an acknowledgement is received, the packet completes a Round trip from sender to receiver and acknowledgement back to sender. This is calculated and is called the Round trip time (RTT). A Congestion Window (CWND) is the amount of bytes that can be sent at a given time. During a slow start the CWND is usually set to a small size. Each acknowledgement that arrives this CWND is doubled till it reaches a threshold. Once at the threshold the increase in CWND is increased very slowly (Additive increase) so as to not cause congestion in the first place.

During this buildup of CWND, a loss of the packet could occur. A loss is indicated by no acknowledgement arriving in some time (mutiple of observed RTT). If a loss occurs the CWND is cut to half (multiplicative decrease) and the slow buildup process starts again.

This was the TCP primer. There is a lot more to cover including the TCP state machine but thats for some other day. If you are with me till this point just remember TCP provides reliability using acknowledgements and flow and congestion control mechanisms.

Data Centers and TCP

In the first part ( https://medium.com/@ajaykul/data-center-networking-part-i-511b6cfcc8fe) we saw the Data center architecture and how there exists multiple paths to reach from one source to the destination. For sake of uninterrupted discussion I am pasting the architecture again here:

The existence of multiple paths allows us to split traffic across multiple path. Splitting traffic leads to better network utilization by providing improved reliability along with increased network throughput. However sending traffic across multiple links (multi path) also causes disorder, packets arrive at the destination in different order than they were sent in. So for example packet #1 may arrive after packet #4 since packet #4 might be routed on a faster path. This would mean the destination has to first order them together before presenting them to the upper layer. The higher layer might thus send the acknowledgement much later than the packet was first sent. This might cause unnecessary retransmits and congestion avoidance mechanisms to kick in which decreases the throughput.

This said however, currently we still use the muti-paths to get benefits from multiple available routes. The way we do is by using any of the following schemes:

a. Round-Robin Scheduling: The packets are forwarded on available different paths by polling. So consider you have 4 paths A,B,C, D, you take the first packet and send it on path A, second on B, third on C … The advantage is its simplicity and the disadvantage is packets of the same flow end up on different paths causing performance variance.

b. Weighted Round Robin: In this you assign weights to paths. The higher the weights the most likely the packet will be assigned to that path. The question then becomes how do we decide these weights and given the dynamic nature of the network this might still not solve the network performance variation.

c. Hashing: We saw earlier the 5-tuple: source ip, source port, destination ip, destination port along with protocol. Hashing uses the 5-tuple information carried in the packet, calculates the hash of these 5-tuple values and uses this hash to map it to a path. So if the 5-tuple is {10.0.0.1, 5223, 20.0.0.1, 80, 6} we take these 5 values and say concatenate them into one large string. This can be used as an input to a hash generating algorithm to output a single integer value from the string say 02. The Hash generating algorithm is usually a well known algorithm like MD5 and the output value is say 128-bit value. The larger this value the better as this will avoid collisions but let us just keep this for another discussion in future. For now just imagine a Hash generating algorithm exists and continue with assumed output of 02. This “02" can then be mapped into the available paths and say path B is selected. A flow is defined as one source sending packets to destination. Given one flow (5-tuple is same for the flow) all the packets of this flow will generate the same hash and thus the same path will be selected for the entire flow. This is a huge advantage and the performance variation for a flow could be reduced. This is the most common way of doing things in a data center.

In data centers, things are pretty symmetrical: same number of hops between source and destination. Thus this kind of arrangement is called Equal cost multipath routing (ECMP).

Flowlets

Consider a TCP flow between sender A and receiver B. A flowlet is defined as a collection of packets that have a short inter-packet time between themselves but are separated from previous packets of the same flow by a long time gap.

The following image [1] identifies the flowlets in a TCP flow:

Flowlets in a TCP flow

Flowlets arrise in a TCP flow due to multiple reason. As discussed earlier the TCP uses flow control and congestion control to manage the flow. The typical TCP operation is that the TCP sender will send a bunch of packets and then wait on the ACK to arrive to resume further operations. This might cause a long enough gap to generate flowlets. Such gaps in communication are very common as hosts spend time generating and processing payloads they received within the packets.

The next question is what is a long enough gap to be considered a new flowlet? This is called a flowlet timeout. A timer is started when a packet is transmitted and when the timer has reached a large enough value then the next packet is considered to be a part of the next flowlet. If there is a transmission before the timer reaches the large enough value the timer is restarted.

There are certain restrictions for the flowlet timeout value. The timeout value must be large enough to exceed the latency amongst all the paths available. A short timeout value will mean that each packet will be treated as a new flowlet. We do not want this. We want flowlets to be a batch (bunch) of packets such that we send the entire bunch on a single decided path. The way to do this is we can include the flowlet-id # along with the 5-tuple in our hashing algorithm. If we ensure that the flowlet timeout is greater than the latency delay on all paths, then we can be sure that there wont be any out of order delivery between packets. This flowlet timeout value is critical to any flowlet based load balancing scheme.

Most per-flowlet balancing schemes today use a fixed flowlet timeout (typicall 50–100microseconds). This empirically configured value could cause issues in practice as when a flow is burst enough then this will end up as ECMP and each packet might endup on a different path.

The following results obtained from CONGA [2] that uses flowlets to compare performance with ECMP and MPTCP. These results use real hardware testbeds, CONGA and CONGA-flow differ in their choice of flowlet timeout used: CONGA uses 500 microseconds where as CONGA-flow is 10milli seconds. FCT is the flow completion time, less is the better.

Future

This area of work is being actively researched by various groups and provides a way to improve network performance. However, we assume all traffic to be typical ethernet traffic. What if the traffic profile changes. What about RDMA traffic? How would this kind of network load balancing work with high performance computing and data center networks? Such kind of traffic flow characteristics need to be considered before coming up with the right load balancing algorithms.

Such are the questions we can consider in the upcoming writeups. Thank you for being with me till this point. Questions/comments are welcome!

References

Guo, Z.; Dong, X.; Chen, S.; Zhou, X.; Li, K. EasyLB: Adaptive Load Balancing Based on Flowlet Switching for Wireless Sensor Networks. Sensors 2018, 18, 3060. https://doi.org/10.3390/s18093060

M. Alizadeh, T. Edsall, S. Dharmapurikar, R. Vaidyanathan, K. Chu, A. Fingerhut, V. T. Lam, F. Matus, R. Pan, N. Yadav, and G. Varghese. CONGA: Distributed Congestion-aware Load Balancing for Datacenters. In Proceedings of the 2014 ACM Conference on SIGCOMM, SIGCOMM ’14, pages 503–514, New York, NY, USA, 2014. ACM

--

--

ajay kulkarni

Working in the field of Networking and Network Security. Have a PhD in network protocols/performance.