“NetFlicks”: visualizing network events

Ankit Singla
6 min readDec 28, 2021

--

This is part of a series of write-ups I’m working on, drawing on my notes and materials over the last roughly 7 years. I’m leaving academia for an exciting industry opportunity after not getting tenure. The research ideas I’m putting out are open for anyone to build on. The caveat is that my background work on many of these, including a full review of the relevant literature, is incomplete — this is just a snapshot of where I happened to be in my thinking along these lines.

We are still, in 2021, often using the same tools for understanding and debugging networks as we did 30+ years ago, e.g., tcpdumps and traceroutes. Meanwhile, networks have become orders of magnitude larger, more complex, and hard to understand and reason about.

While there is a rich line of work on making networks more robust, e.g., most recently with network verification and synthesis, and on automated methods of summarizing network logs, I haven’t seen much in the way of tools that aid intuition and understanding by visualizing network events and phenomena.

For instance, instead of examining a tcpdump using Wireshark or (even more painstakingly) the line-by-line text event log, why don’t we have something that shows a movie-like view of the packet sequence, highlighting important or anomalous events?

The simplest use of this is in networking education, whereby you can show students various events in a packet stream and what parts of the transport algorithm’s state machine they correspond to. For the simplest scenarios, one could imagine a side-by-side view of the network bottleneck’s evolution in terms of buffer state, together with how those events trigger reactions from congestion control logic. (I’m aware of nice animations of the behavior of two simple transport algorithms, but nothing that uses network logs and data in the way I’m describing here.)

For more complex use cases, such as the debugging of a complex networked system, it would be great to have a workflow where the following occurs:

  • From a network data log or a set of logs, an automated process identifies interesting events. These could be based on several methods, e.g., anomaly detection or triggers encoded by network operator, e.g., “loss exceeds x%”.
  • A second step would order events and identify potential causality among the interesting events. This can use a variety of approaches customized to the context in question, e.g., for TCP logs, one could imagine inferring rules like “a decrease in the receiver window could cause a decrease in transmission rate”, “packet loss could cause a decrease in transmission rate”, “packet loss may be preceded by a period of higher RTTs.”, etc.
  • Finally, and this where I think prior work is lacking most, a third step visualizes the potential chain(s) of causal events for the network operator as a candidate analysis of the situation. The visualizations can involve several components: the system itself, e.g., an appropriate abstraction of a data center topology showing where loss is occurring and which flows are contributing; the key parts of system state involved, e.g., the buffer(s) at particular switches; and the rules / logic of the algorithmic process that are triggered in response, e.g., congestion avoidance mechanisms; etc.

Each of the three steps mentioned above are far from trivial, but the thing is: if this “NetFlicks” system can build a one-minute movie showing the operator a plausible evolution of their system, it doesn’t have to always get it right. If it gets it wrong, the operator can always go back to whatever methods they’re already reliant on. But if NetFlicks gets it right, the operator might learn the needed insights in a minute rather than hours or days of painstaking analysis. The vast gap in the usefulness when correct versus the annoyance when mistaken seems to set a low bar for the first tools along these lines. How cool would it be to have a plausible diagnosis of a situation that can be understood in a minute, even if it’s only right (or nearly-right), say, 20–30% of the time?

Of course, such a tool will need to be customized to every context, with different tooling for a satellite network than for a data center network, and different tooling to debug routing vs. transport issues, with things that involve both being even more complex.

To make the above discussion more concrete, consider a hypothetical NetFlicks extension of the Hypatia framework for visualizing satellite networks which might find the following set of events and causes:

  • [Anomalous event] Several connections are experiencing very low throughput, even though the links they use are under-utilized.
  • [Causality-A] The delay-based congestion control logic has throttled the sending rate.
  • [Causality-B] Queuing delays are fluctuating.
  • [Causality-C] The end-end path sometimes elongates due to satellite motion (increasing the delay even though there’s no congestion).

Now, across a set of traces, it should be possible to identify that C leads to B leads to A leads to the low throughput event, and to display the following in a “scene”:

  • An example path change, highlighting the elongation of the path, for instance like the change in the below graphic, but just continuous instead of looped.
  • Side-by-side, the corresponding throughput evolution over time.
  • Side-by-side, the part of the congestion control logic that is triggered at the point(s) throughput drops or increases substantially.

I was lazy here, but I very well could have put together the desired movie for such a scenario manually if I wanted, using Hypatia. But the goal of a project along these lines should be to help uncover and visualize such insights in as automated a fashion as possible. Of course, a variety of pieces will need to be manually provided up front, e.g., ways to visualize the system at any time, encoding what types of events could be interesting, what metrics might be interesting to monitor the temporal evolution of, etc.

A gif-image  shows a path over an LEO satellite network changing in length.
The path elongation (worth several tens of seconds here) can cause delay changes that cause delay-based congestion control to mistake the delay increase for congestion and throttle throughput.

Ideally, NetFlicks runs through “normal” operation speedily, and slows down to highlight the part crucial to understanding what’s going wrong during the throughput deterioration event. Sort of like slow-motion enhancements used in action movies and the below adorable cat-action (which I most definitely did not include only to capture your attention):

From “The Slow Mo Guys” on YouTube.

The same type of visual reasoning, with different base visualizations of the system and different events and control logic encoded, should be applicable for a variety of networked contexts, e.g., for a data center environment, or a Google/Microsoft-style inter-datacenter WAN.

In my view, this is a reasonably concrete project vision that could produce HotNets-like work (showing very basic functionality) within 4-ish months of work, with a healthy potential for leading to a series of papers and software as part of a solid dissertation, and (very optimistically) even the foundation of a startup around this type of visual network intelligence that complements tools sophisticated network operators use.

I am looking forward to finding out (hopefully soon) how some of the most complex and best-managed networks today my are managing network debugging and building understanding and intuition, but meanwhile, if anyone is willing to share anything in the state of the art along the lines I’ve sketched, please do. I’m reasonably sure that there are reams of related literature I could have cited that provides various pieces that would be of use for building NetFlicks, and if you want to leave pointers in comments, that’s more than welcome. If I find the time and inclination, I can try to update this writeup with the most salient pointers.

--

--