Cross-components data flow

Moving from monolithic architecture towards micro-services has its, well known, benefits. But they come at a price. Dependencies are now way harder to identify. Telling how data flows between components becomes non-trivial task. But there’s a solution…

Published in

Legacy Systems Diary

3 min readDec 5, 2017

Let’s say that you have a task to sunset a feature or replace a legacy piece of code, or you’re debugging an issue in quite a new feature. Only taking a look at the code is often not enough as dependencies go far beyond just the function calls or HTTP requests to micro-services. Code is just one part of the equation — it is there to process the data and understanding the data flow is a key here.

Of course, you can grep the code and identify all the SELECT and INSERT / UPDATE queries, message queue pushes and all the rest in various repositories and prepare a graph of what writes and reads the data. But there’s a problem — this graph is out of date the moment you complete it.

data-flow-graph to the rescue

macbre/data-flow-graph

data-flow-graph - Uses your app logs to visualize how the data moves between the code, database, HTTP services, message…

github.com

Fortunately, the process of generating the dependencies graph can be automated. You do log your SQL queries, task queue pushes or s3 storage uploads, right? Why not use this machine-generated data and automate the process of putting a graph together? And as a bonus you’d get nice, visual documentation that’s always up to date. Dreams coming true? :)

Let’s first define what we will present on the graph like the one presented above:

source node: script or code class name, database table, s3 bucket, message queue name, HTTP end-point, ...
edge name: DB query, method name, basically — an action that took place and moved the data from source to target node
target node: can be a database table that is queried with INSERTs coming from an offline script (i.e. a source node)
edge weight (optional): the edge thickness can indicate the percentage of traffic, queries per second, …
edge metadata (optional): will be displayed as edge’s tooltip

Let’s describe the graph

We will use a simple TSV format that can be easily filtered and concatenated:

(source node)[tab](edge name)[tab](target node)[tab](an optional edge weight)[tab](an optional metadata)

An example:

backend:events_local_users.pl   events_local_users.pl:651 (INSERT)  specials:events_local_users 0.98    job, median time: 1107.85 ms, count: 93800

The line above represents a backend script called events_local_users.pl (source node) that performs an INSERT query (edge) on events_local_users table in specials database (target node).

TSV file can be easily generated using a simple script that processes ELK SQL logs. Note: you can prefix source and target nodes to group them and use different colours they’re rendered with on the graph.

Let’s render the graph

data-flow-graph is a visualisation tool built on top of the d3.js visualisation library. It uses a simple, text-based format described above.

You can use an example static HTML file from the repository and paste there a TSV file with graph definition — graph will be rendered for you.

Alternatively, to make it easier to share the graph, you can upload the TSV definition to Gist (either a public or a private one) and use data-flow-graph Gist viewer. Just paste the Gist URL in the prompt and voilà. A shareable link will be generated for you.

An example of data flow between the code, MySQL database tables, redis and s3 storage, visualised using data-flow-graph.

What can be visualised on the graph?

The tool described above can be used in far more areas than just analysing data flows that happen on a database level. The idea can be extended to handle:

message queues (Redis, RabbitMQ, Scribe, …)
HTTP services communication (GET, POST requests)
S3 storage operations
tcpdump / varnishlog traffic between the hosts
use your imagination!

These graphs can also be used as part of a feature’s documentation. And because nobody likes outdated docs, they can be updated automatically by just re-running the TSV-generator script periodically.

data-flow-graph tool was first described on Wikia’s Engineering Blog.