Real-time Anomaly Detection in VPC Flow Logs, Part 2: Proposed Architecture

Igor Kantor
4 min readFeb 12, 2018

--

Photo by Steve driscoll on Unsplash

Before we dive into the technical implementation details, it helps to take a step back to consider the problem at a higher level.

Namely, is there a broader context into which this whole anomaly detection business fits? For example, let’s say we invest hours into this and it actually works, so what? Why should anyone care?

All good questions!

In my opinion, the broader picture here is the Common Logging, Error Handling, and Anomaly Detection Operational Intelligence Platform, better known by its popular name of CLEHADOIP…

…OK, that is not a thing.

However, a modern software engineering organization absolutely must concern itself with democratizing access to the operational data!

This is especially true if you are dealing with a modern tech stack, one that can easily span multiple containers, Lambda functions, web servers, load-balancers, database servers, application servers, microservice orchestration engines, etc.

Further, as anybody who has actually tried doing this in real life can attest, ingesting, aggregating, and analyzing these voluminous, disparate data sets is no easy feat. And to then turn all this data flood into real, actionable operational intelligence is more difficult still!

In fact, highly performant companies like Netflix have entire teams dedicated to this effort. For instance, look at this job description that recently popped up in my LinkedIn fed:

Netflix Operational Insight Team is the team responsible for building common infrastructure to collect, transport, aggregate, process and visualize operational metrics. We build powerful systems to allow everyone at Netflix visibility into the state of our environment at both a macro and micro level.

To me, this is incredibly fascinating. One of the most progressive and forward-looking companies on the planet decided to create a whole team dedicated to allow everyone “visibility into the state of our environment.”

Why?

I think the answer might be subtle but pretty obvious, still. They are doing this to make everyone’s lives easier!

And I truly mean, everyone. Because once you have your operational data unified, with free access to all who need it, your SREs are no longer stuck with a tail -f /var/log/messages off an NFS share or an rsyslogd server.

And your software engineers have immediate visibility into the code they have written and can spot behavioral anomalies across a wide swath of servers or microservices. Or… who knows… they might turn to machine learning to do this. :)

And your DevOps teams can easily build CI/CD pipelines that perform canary-based deployments, watching the real-time metrics coming back from the deployed software to adjust the traffic percentages accordingly.

Finally, your CIO and CEO will now have real-time visibility into the overall health of the business: either from the technical standpoint or the business metrics or a mix of both.

In short, centralizing your logs and metrics is an incredibly powerful construct, one that can easily turn into a true competitive advantage for your organization!

To me, this is the bigger picture here and it might look something like this:

In fact, if you are familiar with Lambda Architecture (the non-Amazon kind), you will recognize this as something very similar — the platform proposed above handles large quantities of data using both stream- and batch-processing approaches.

Now, let’s break this down so we all know what we are proposing to build here.

  1. The elements at the bottom of the diagram (everything below the Network Load Balancer and the CloudWatch log box) are your data sources. These will range from infrequently updating like system event logs to those pumping massive quantities of data like packetbeats.
  2. The components within the red dashes comprise the anomaly detection pipeline we are going to build as part of this series. The pipeline will receive the streams from multiple sources. The dashed line from the Logstash to Kinesis is a future possibility, I don’t yet know if this makes sense.
  3. The boxes in green are your longer-term archival. Notice that every message goes to three different places: a) the real-time streaming pipeline for immediate anomaly detection (minutes); b) ElasticSearch+Kibana for short-term (days) trend analysis; and c) S3 object storage for long-term archival and analytics (months).

To summarize, the whole puzzle and the holy grail we are after, is an open, feature-rich operational intelligence platform that reveals new business and technical insights for the entire organization!

OK, enough rambling. Go on to Part 3 where we actually build this thing!

Shameless self-promotional plug: if you are curious how to build the other piece of this puzzle (the blue dashes), which is a fully containerized ELK (ElasticSearch, Logstash, Kibana) stack running in Amazon Web Services, please see my other series.

--

--