Visualising network traffic in a legacy system

In one of our recent stories we described how cross-component dependencies can be tracked and visualised. Back then we focused on database queries. Today we’ll take a closer look at network packets travelling between components.

Maciej Brencz
Legacy Systems Diary

--

At Wikia, our legacy Perl-based scripts use Facebook’s Scribe to process events produced by MediaWiki application. These scripts were created to defer insert queries and generate statistics that our users and staff members can use.

You’re right, scribe is in no longer maintained (the last commit dates back to May 2014). That’s one of the reasons we decided to replace Scribe (that we were using as a message queue) and Perl scripts with RabbitMQ and tasks written in PHP (so that we can run them in MediaWiki context).

Scribe-based message queue

Each Scribe message needs to pass several hops before reaching the backend script:

  • MediaWiki app sends Scribe packet (with category specified)
  • local Scribe agent sends a packet to the Scribe master node
  • Scribe master node uses a packet category to route it to a next Scribe client
  • the last Scribe client listens on a specific TCP port and receives packets from the master node

Before we can sunset this complex, old system, we need to take a closer look at the routing of packets, packets categories and which nodes create and which nodes read Scribe messages — i.e. identify what actually happens under the hood.

ngrep is your friend

The fact that all Scribe packets need to go through the master node is a huge help for us. All we need to do is grab a rather large (thousands of packets) piece of network traffic that flows there. We’ll use ngrep packet sniffing tool and save matching packets to PCAP file:

sudo ngrep -d any -P ' ' -W single '..Log..' port 9090 or 1463 or 5095 or 5088 or 9900 -n25000 -O scribe.pcap > /dev/null

Please refer to ngrep tutorial for the description of parameters used above.

We’ll just focus on a pattern used — ‘..Log..’ is a part of all Scribe messages. Ports list was taken from Chef recipe that configures the Scribe master routing rules (1463 is a port Scribe master gets incoming packets on and the rest are ports where clients receive data routed by the master). Specified portion of packets covers 30 seconds of network traffic in production.

From PCAP file to a graph

Ok, so we have a PCAP file with thousands of packets — IP addresses, ports and their content is there.

data-flow-graph project examples include PCAP processing script. It can analyse redis, scribe and generic traffic using a powerful scapy Python library. By parsing packets content (redis pushes, scribe categories or IP addresses in a generic use case) it can generate TSV file with data flow.

Let’s run it:

# processed 25000 packets sniffed in 27.00 sec as scribe
src:ap-s* scribe app_custom_events 1.0000
src:ap-r* scribe app_custom_events 0.7631
src:ap-s* scribe xhprof_data 0.2360
src:ap-s* scribe mwprofiler_data 0.1987
xhprof_data scribe dst:metrics-etl-s1 0.1939
mwprofiler_data scribe dst:metrics-etl-s1 0.1544
...
src:ap-s* scribe log_create 0.0026
log_create scribe dst:indexer-s2 0.0017
src:cron-s1 scribe xhprof_data 0.0017
log_create scribe dst:mq-s3 0.0013
log_create scribe dst:mq-s4 0.0013
log_create scribe dst:job-s1 0.0013
...

MediaWiki application, running on Apache nodes in both of our data-centers (src::ap-s* and src:ap-r* nodes), produces different kinds (categories) of Scribe messages. Based on a category they are routed to various consumers on different nodes:

  • metrics-etl-s1 consumes profiler data and pushes performance metrics to InfluxDB
  • mq-s3 / mq-s4 generate raw event logs that we push to S3 buckets for later processing
  • job-s1 is a node where legacy Perl-scripts are running
  • indexer-s2 pushes documents to Solr server that powers search on our wikis
Scribe messages flow generated using above-mentioned tools: blue indicates nodes where events are generated, green — Scribe categories and orange marks client nodes where events are consumed

Generated TSV file can be saved on Gist and passed to data-flow-graph visualisation tool.

By running just two commands (ngrep and a helper Python script) our mysterious legacy system became slightly less mysterious. We can easily re-run this tool chain to keep track of the progress and review our steps as we sunset Yet Another Legacy System.

Visualise everything

The approach described here can be used in various cases:

  • analysing HTTP traffic reaching Apache’s Solr
  • visualising how requests flow between proxy / load balancer and backend nodes
  • requests travelling to a specific 3rd-party server(s)
  • RabbitMQ pushes and pops
  • and many more…

--

--

Maciej Brencz
Legacy Systems Diary

Poznaniak z dziada-pradziada, pasjonat swojego rodzinnego miasta i Dalekiej Północy / Enjoys investigating how software works under the hood