StreamAlert: Real-time Data Analysis and Alerting

Published in

The Airbnb Tech Blog

6 min readJan 31, 2017

Today we are incredibly excited to announce the open source release of StreamAlert, a real-time data analysis framework with point-in-time alerting. StreamAlert is unique in that it’s serverless, scalable to TB’s/hour, infrastructure deployment is automated and it’s secure by default.

In this blog post, we’ll cover why we built it, additional benefits, supported use-cases, how it works and more!

Why StreamAlert?

Airbnb needed a product that empowered both engineers and administrators to ingest, analyze, and alert on data in real-time from their respective environments.

As we reasoned about our use cases and explored available options, we codified our requirements:

Deployment is simple, safe and repeatable for any AWS account
Easily scaled from megabytes to terabytes per day
Infrastructure maintenance is minimal, no devops expertise required
Infrastructure security is a default, no security expertise required
Support data from different environments (ex: IT, PCI, Engineering)
Support data from different environment types (ex: Cloud, Datacenter, Office)
Support different types of data (ex: JSON, CSV, Key-Value, Syslog)
Support different use-cases like security, infrastructure, compliance and more

We were unable to find a product that fit these requirements, so we built our own. Since one of our requirements necessitated the product be environment agnostic, it naturally lended itself to being an open source project.

Benefits

As partially outlined above, StreamAlert has some unique benefits:

Serverless — StreamAlert utilizes AWS Lambda, which means you don’t have to manage, patch or harden any new servers
Scalable — StreamAlert utilizes AWS Kinesis Streams, which will “scale from megabytes to terabytes per hour and from thousands to millions of PUT records per second”
Automated — StreamAlert utilizes Terraform, which means infrastructure and supporting services are represented as code and deployed via automation
Secure — StreamAlert uses secure transport (TLS), performs data analysis in a container/sandbox, segments data per your defined environments, and uses role-based access control (RBAC)
Open Source — Anyone can use or contribute to StreamAlert

Use-cases

The graphic below denotes some example data sets that StreamAlert can analyze:

StreamAlert aims to be as agnostic as possible in order to support the widest range of data analysis and alerting use-cases.

At a high-level, SteamAlert supports:

Any Source — StreamAlert can accept data from an S3 bucket or any agent/service that supports sending to Amazon Kinesis Streams. Examples: fluentd, logstash, aws-kinesis-agent, osquery, Java, JavaScript, Ruby, PHP, or any language supported by the AWS SDK
Any Operating System — StreamAlert can accept data from any device that supports log forwarding (Linux, MacOS, Windows, …)
Any Environment — StreamAlert can accept data from any environment that has internet connectivity (Cloud, Datacenter, Office, Hybrid)

From a data perspective, StreamAlert supports file formats such as JSON, CSV, Key-Value, and Syslog formats.

If you’re an AWS customer, gzipped versions of these data formats are supported in S3 buckets. As a result, StreamAlert supports CloudTrail, AWS Config and S3 Server Access Logs out of the box.

If you’re not an AWS customer, StreamAlert can support data such as:

Host Logs (e.g. Syslog, osquery, auditd)
Network Logs (e.g. Palo Alto Networks, Cisco)
Web Application Logs (e.g. Apache, nginx)
SaaS providers (e.g. Box, OneLogin)

It should be noted that StreamAlert is not intended for analytics, metrics or time series use-cases. There are many great open source and commercial offerings in this space, including but not limited to Prometheus, DataDog and NewRelic.

Getting Under The Hood

Data Analysis

Rules encompass data analysis and alerting logic and are written in Python. Here’s an example that alerts on the use of sudo in a PCI environment:

StreamAlert Python Example

You have the flexibility to perform simple or complex data analysis. Some of the notable features of this approach are:

Rules are written in native Python, not a proprietary language
A Rule can utilize any Python function or library
A Rule can be run against multiple log sources if desired
Rules can be isolated into defined environments/clusters
Rule alerts can be sent to one or more outputs, like S3, PagerDuty or Slack
Rules can be integration tested

Alerting

As outlined above, StreamAlert comes with a flexible alerting framework that can integrate with new or existing case/incident management tools. StreamAlert enables your Rules to send alerts to one or many outputs.

Out of the box, StreamAlert supports S3, PagerDuty and Slack. It can also be extended to support any API. Outputs will be more modular in the near future to better support additional outputs and public contributions.

Adhering to the secure by default principle, all API credentials are encrypted and decrypted using AWS Key Management Service (KMS).

Architecture

StreamAlert utilizes the following services:

AWS Kinesis Streams — Data stream; AWS Lambda polls this stream (stream-based model)
AWS Kinesis Firehose — Loads streaming data into S3 long-term data storage
AWS Lambda (Python) — Data analysis and alerting
AWS SNS — Alert queue
AWS S3 — Optional datasources, long-term data storage, & long-term alert storage
AWS Cloudwatch — Infrastructure metrics
AWS KMS — Encryption and decryption of application secrets
AWS IAM — Role-based Access Control (RBAC)

If this looks overwhelming, don’t worry — recall that infrastructure deployment is automated via Terraform, ensuring that you don’t have to manage, patch or harden any new servers!

The Future

The idea of crowdsourcing your alerts isn’t new. Slack does this and the blog speaks at length to the benefits. In the near future, StreamAlert will support this use-case, allowing you to decentralize your triage efforts, getting alerts to those with the most context. We’re aiming for Q1/Q2'17.

In the near future, StreamAlert will support comparing logs against traditional indicators of compromise (IOCs), which can range from thousands to millions in volume. This will be built in a way that’s provider agnostic, allowing you to use ThreatStream, ThreatExchange, or whatever your heart desires. We’re aiming for Q1/Q2'17.

StreamAlert will also support receiving data via an HTTP endpoint. This is for service providers or appliances that only support HTTP endpoints for logging. We’re aiming for Q2'17

For historical searching, StreamAlert will use AWS Athena, a serverless, interactive query service that uses Presto to query data in S3. This will allow you to analyze data using SQL for both ad-hoc and scheduled queries. We’re aiming for Q3/Q4'17.

Concluding Thoughts

Open source has allowed us as a community, to both share, collaborate, and iterate on common needs and goals. Now with the ability to represent infrastructure as code, this goal can be further realized with reduced costs for both development and deployment.

We hope StreamAlert serves as an example of this, making deployment simple, repeatable and safe so that anyone can use it easily.

Credits and contributions:

@jack_naglieri (core architect and engineer)
@mimeframe (concept, website, docs, code & content review)
@strcrzy (rules decorator code)
@zwass (osquery kinesis plugins)
@hackgnar (osquery kinesis plugin bug fixes)
@austinbbyers (code review)
@emma_c_lin (code review)
@awscloud team (AWS services and support)
@hashicorp team (Terraform)

Finally, if you want to get started using or contributing to StreamAlert, please visit https://github.com/airbnb/streamalert

Thanks!

Security Team @ Airbnb