A Technique for real-time Network Traffic Classification

Hello World!

In this article, I am going to explain a technique to classify network traffic into flows in real time. This is useful for network monitoring systems which require on-the-fly processing of incoming and outgoing traffic. Such real-time classification of traffic into flows facilitates further processing and analysis by plug-ins which can detect security attacks, for instance.

This classification is done at the transport layer i.e. this system classifies network traffic based on socket layer connections (e.g. 192.168.0.1:80 ← → 192.168.25.157:4585). This technique maintains separate trees for each transport layer protocol e.g. TCP and UDP.

On high level, this tree-based classification is very similar to decision-trees. The idea is, whenever we come across a new packet, we start from the root of the tree and traverse that packet down the height of the tree. At each level, the decision of which child node to choose is made based on certain parameter from packet’s header (for example, port number). After a fixed number of levels, the packet reaches its leaf node. We repeat the process of traversing the tree for every captured packet (This is quite fast considering the running time, and as you will see later, the expense is worth it). The most important idea is, ultimately, all the packets belonging to the same flow end up on the same leaf node.

Now let’s zoom in into the internals of this tree.

All the packets corresponding to same source end up on the same leaf node (192.168.25.5:41440 in this case)

For each tree, the first level of internal nodes (depth 1) indicate the port number and the next 4 levels indicate 4 octets of IP address. If we come across a packet with source as 192.168.25.5:41440, it will be classified as follows — from the root node, it will go to first level node corresponding to 41440 port number, from there, it will go to child node corresponding to “192” as first IP octet. Then, to the child node corresponding to “168” as second octet and so on. It will ultimately reach the leaf node from where any desired operation can be performed.

Note that this is a dynamically generated tree. The nodes are created as and when required. For example, the path TCP →41440 →192 →10 →30 →40 is only created while traversal of first packet which belongs to source 192.10.30.40:41440. With the passage of time, this tree is bound to get mammoth as new IP addresses and port numbers keep coming into picture. To overcome this, you can set a pruning interval of, say, 1 hour. Such that after each 1 hour, the grown tree till that moment will be discarded and we’ll start with new root node with no child.

Now comes the question, how can you leverage getting all the packets of a flow on the same leaf node. You can easily create daisy links between packets belonging to the same flow in the packet dump. This can be done as follows — whenever you encounter packet 1 of a flow, write it to the packet dump file along with an additional blank field (call it ‘next’), to store the offset value of the next packet which will come (packet 2). Note down the memory offset value of packet 1 in dump file on the leaf node which it had ended up. When packet 2 of the same flow comes, write it to the dump file. Note the memory offset where it got written and update this on the ‘next’ field of packet 1. You can jump to packet 1 by reading the offset which we have stored on the leaf node. Now, update the leaf node with memory offset of packet 2. Similarly for packet 3, dump it to the file, update the ‘next’ field of packet 2 to point to its own offset, and update the offset value on leaf node. Think of it this way: each packet after reaching the leaf node, gets the location of its previous packet in dump, updates the ‘next’ pointer of previous packet to its own location in dump (where it will get written to) and after actually getting written to dump, updates its location on the leaf node such that next packet can find it.

Me, along with my colleagues, have implemented the above technique in my undergrad project “nMASE: A Search Engine for Network Trace”.

This was my first medium article and hope you enjoyed it. Feel free to leave questions or comments below :)