Event Data at Scale

Creating data infrastructure for modern organisations

Published in

Beamery Hacking Talent

13 min readOct 18, 2020

As the data requirements placed on companies increases year-on-year there has to be a solid infrastructure and process in place to make sure that not only is data safe and secure, but also is captured in a meaningful way allowing it to become useful to a business in various different ways.

While this applies to nearly all modern companies, SaaS companies such as Beamery have to be very astute in the way that data is captured, stored, and used. Event data is the life-blood of any company wishing to be around for the foreseeable future as it allows them to make informed decisions about the business and what customers are doing. In this article we will discuss the key issues that need to be tackled to make sure that event data is captured and used in the correct way.

We will delve into the specifics of the technologies that Beamery settled on, and why, at the end of this article, but first we need to cover some important topics.

So, what are the main things that need to be considered when storing event data at scale? There are three main areas that should be considered and will lead to the correct selection of technologies used.

Security
Scalability
Usability

Event Data

Before we dive headlong into each of these topics it’s worth taking a step back and defining the kind of data that we are trying to capture. In simple terms event data can be broken down and defined as a piece of information about something very specific that happened at an exact point in time.

Examples of an event is a user of a web application clicking on a button, sending an email, or altering some data. Each of these events are very different in and of themselves, but when broken down they are all essentially the same thing, some time-series data that needs to be captured, analysed, and made sense of.

There are many different ways to capture events, and we will discuss those in following articles, but for now we will look at a more generalised view of what is important from a data perspective.

Security

Security in general is a big topic, especially data security with governmental rules as laid out in the General Data Protection Regulation (GDPR) directives that came into force on the 25th May 2018. GDPR puts a heavy emphasis on data security and not following through on a robust data security policy can result in large fines that some businesses might not be able to recover from.

When we talk about data security as a business we need to compartmentalise things. This is encapsulated by international standards such as ISO 27001 which asks a business who is responsible for specific data within an organisation.

In general the main considerations are about securing access to data in multiple ways. This encompasses ensuring that only the correct systems and people have physical access to hardware or cloud systems storing the data, making sure that data is encrypted when it is not in transit (encryption at rest), or making sure that data is only ever transmitted over secure links (TLS, SSH etc.).

These are but a few considerations and there is not enough room within this post to go into fine detail, suffice to say data security is something that should be taken very seriously and researched fully. You have been warned!

Scalability

When we talk about scale in the technology industry it will normally be around making sure an API or some other user components can meet the demands made of them in a consistent and predictable way. Scaling data systems is just as important and it is not always immediately obvious what should be scaled and how.

We all know that to be able to service requests from a database we can scale that database horizontally by adding more nodes, increasing shards etc. But what of the systems that connect to large data stores? These will also need to scale just as much as the primary databases, without this processing data for business intelligence reports will become time consuming and lag behind events as they arrive. This essentially means that intelligence data on which business decisions are based will not be current, and worst case not relevant any longer. Being able to react to data changes quickly is extremely important.

Scaling considerations for connected systems will normally be focused on processing system, things such as message brokers, cold storage systems, data lakes etc. With these systems scaled to be able to cope with an increased event data rate it then becomes possible to process data in near real-time with the correct technology choices.

Usability

The concept of usability is generally referenced in relation to user facing interfaces in software applications or web browsers, but in this instance we are talking about how useable a piece of event data is. We must be careful in the definition of a schema for events data. A schema must be not only future-proof, but also contain fixed data points that make each event useful to derive information from when processed along with other events that have been collected.

All event data is time-series, that is we will have a number of data points associated with a specific point in time. There are a number of ways we can define a useful schema, but to do so we must know what we are trying to produce at the end of the event ingestion.

Things to consider are:

Are the proper identifiers attached to each event?
Do events need transforming or enriching to be more meaningful (location data, user data, etc.)?
What format should transformed event data be in for ease of processing (JSON, Avro, something else…)?
Can engineers easily interact with event ingestion systems?
Do engineers have the skillset now or will training be needed?

While these questions are not exhaustive they provide a guide that will result in a good schema and graceful systems design that can be used to generate useful intelligence data later on. This is of paramount importance.

Technology Selection

There are many different technologies that live within the data processing arena, from older well known systems based on Hadoop clusters to newer systems like Flink, ksqlDB, and others. The right selection of technology to use when processing your event data can depend on many different things, though probably the most important question you need to ask is; “what do you want to achieve?”.

Here are a few options that will likely need to be considered before any technology selection takes place.

How much data do we need to process now?
How much data will we need to process in one or two years?
Will we need to scale consistently over time or account for sudden jumps in event data volume?
Do we need a system that can operate at near real-time or are we happier working with a system that does batch processing at specific times?
Can the existing engineers in the organisation access the technology easily or are new skills required?
Do we need to hire or train staff?

Assuming that all questions have been answered in a satisfactory way it’s time to start actually selecting the technologies that are going to be used in your data processing pipeline. Generally there are five or six distinct areas to an event processing system, these are:

Producers — as the name gives away, these deal with event production. Event producers are probably the most varied group of components as they can encompass many different things. For example; a database generating a change stream, users clicking on web page buttons, an email being sent etc. The point is a producer in some form will generate events.
Collectors — Closely related to producers, a collector will take generated events and move them into some form of transient data store. In effect they act as a proxy to provide a common way to move events into the data pipeline. Producers and collectors are quite often embedded in the same component. How these are split up — or not — will depend very much on the events being captured.
Transient Storage — more often than not, the transient event storage will be some kind of message broker. This component allows data events to move as messages in a predictable way from the producers/collectors for processing. There are many different message brokers, or queue systems, on the market with each having their own way of handling data. The selection here will depend very much on the requirements of the pipeline being designed.
Consumers — consumers have the task of retrieving each new message — event — from the transient store and processing the data in some way. How the data is transformed will be dependent on what the event pipeline is designed to do. However, generally we find that events will be transformed into a common format for final storage. Transformation can also include enrichment which is where additional information can be added to the event to give it greater context. There can be many different types of consumers from small custom workers, to larger and more complex stream processing platforms.
Data Lake (optional) — before data is processed into the final schema to be stored and queried it might be desirable to maintain a data lake. That is, a storage area of the raw, unprocessed, events. This is useful to have as over time business needs change and different types of transforms might need to be applied to the raw data to gain different insights. A data lake could be a database or files in cold storage (cloud provider buckets and objects).
Data Warehouse — the data warehouse is the final resting place for processed — and also if desired raw — events. That is not to say each event ends its journey here, but in general terms this is where events can be queried to produce useful business insights. As with other components in the pipeline there are many forms of storage, but generally we could say this is some kind of database.

The Beamery Event Data Pipeline

So, after all of this discussion what did Beamery actually settle on for an events pipeline? The first thing we did was to answer all of our questions over data volumes, ease of use, schema design etc. This narrowed the field down greatly to allow us to go more in-depth on a smaller set of technologies. For completeness here is a quick recap on some of the questions posed and our answers to them.

Q: How much data do we need to process now?
A: We know exactly where we need to start off with event data volumes based on our existing data metrics.

Q: How much data will we need to process in one or two years?
A: We know we have a fairly linear curve from our historic data, but we also know that when a new bigger customer joins we see sudden surges of event data. Based on this we need to make sure we allocate 50% overhead capacity at all times.

Q: Will we need to scale consistently over time or account for sudden jumps in event data volume?
A: We need to account for both scenarios. Which means we need components that are easy to scale out on demand.

Q: Do we need a system that can operate at near real-time or are we happier working with a system that does batch processing at specific times?
A: We wish to be able to gain insights into event data quickly, and to also base some user facing reports on this. Given our customers expect fast feedback we need systems to operate at near real-time.

Q: Can the existing engineers in the organisation access the technology easily or are new skills required?
A: Acquisition of some new skills is acceptable for this project. Skills such as the ability to provision, operate, and monitor new components. We should however still make use of the primary core language in use at Beamery. Beamery engineers mostly work with Javascript/TypeScript, if you are interested in a challenge see career openings here.

Q: Do we need to hire or train staff?
A: We do not want to hire staff specifically to build this system so chosen components and system design should be easily accessible.

With answers to our basic questions we will now look at each of the pipeline components and give a brief overview of what technology we decided on at each step and why.

Producers

Beamery makes use of a number of data stores however the primary one we are interested in for event collection is MongoDB. MongoDB — since version 3.6 — has had change streams which make it possible to listen for data changes in a very easy and lightweight way. This takes care of the production of data change events, but we still need something to watch the change stream.

Collectors

We initially looked at a custom service that would watch the MongoDB change stream and then publish into our transient storage, however after looking around we settled on Kafka Connect along with the connector for MongoDB. Using Connect frees us from the trouble of building and maintaining a collector, and also opens up a wealth of other options due to the Kafka Connect ecosystem of connectors (plugins).

Given that we have chosen to make use of Kafka Connect it gives the game away as to which transient store we will be using.

Transient Storage

Before we settled on using Kafka Connect as a collector we looked at different message brokers to act as our transient data store. We already make heavy use of RabbitMQ and Google’s Pub/Sub however neither of these quite fit in with our requirements for event processing. We wanted to make sure that our data in flight was durable in the case of an outage. Apache Kafka is perfect for this kind of resilience as it persists data to disk and makes it relatively trivial to also replay data from a given point. Kafka is also highly scaleable which fits in with our need to grow over time, and to handle sudden upticks in data volume.

We already had Kafka running in the company for some very specific tasks so it became the obvious choice along with the selection of Kafka Connect to expand the use of Kafka at Beamery.

Consumers

If we are honest the consumer selection was one of the harder components to decide on. There are so many different options out there to read data from Kafka we were spoilt for choice. After eliminating various different options we ended up with two possibilities; writing a custom Kafka consumer to do exactly what we wanted, or use an off-the-shelf component in ksqlDB. ksqlDB is very interesting as it allows interaction with Kafka streams using an SQL dialect, we did some trials and the results were good. However, looking back at our answers and requirements we wanted to have an easy route into working with the Kafka data for all of our engineers without a long learning process on something totally new. For this reason we felt that it wasn’t the right time for ksqlDB and opted for a custom Kafka consumer written in TypeScript.

We will be visiting ksqlDB again in the future as it has a lot of potential once it has matured a little more.

Data Lake

One thing we wanted to be sure of was to future-proof our solution. If we wanted to generate different insights by manipulating our raw data in different ways we want to make sure we still have access to that data. This is where the data lake option comes in. A data lake essentially is a collection of raw data. It can be a collection of files in long-term cold storage (something like Amazon’s S3 or Google’s GCS buckets) or another storage system. As we do not plan on needing fast access to this data we will be making use of object storage in buckets.

We opted to add an option to our Kafka consumer to tag the raw event with an identifier relating to its processed version and then store this so we always have a reference back.

Data Warehouse

The data warehouse is the final resting place of our processed event data. We were not 100% sure initially on which database we wanted to use, but we knew that we wanted it to be an RDBMS with SQL as it’s primary query language. The options here came down to one of PostgreSQL (GCP Cloud SQL), CockroachDB, plus BigQuery and Cloud Spanner both from GCP. We knew that we wanted this service to be managed rather than putting additional load on our SREs so having an easy way to scale out was very important along with the ability to monitor health and performance.

All of the options listed here are more than capable, but with an eye on the future and larger and larger data sets and a need for good write performance with increasing load we eliminated PostgreSQL. PostgreSQL can be scaled in this way, but not available in a managed way and without some additional overhead.

Given that we are already using BigQuery for a number of in-house things we did some testing and found that its limitations around streaming inserts with the streaming buffer caused big headaches with higher throughputs. BigQuery is great for loading large batches of data, but for a large volume of smaller inserts we found it to be lacking for our use case.

This left us with CockroachDB and Spanner. Cockroach Labs offers a managed version of CockroachDB (AWS and GCP), and CockroachDB itself has some very nice features; multi-master, regional data pinning, makes use of the PostgreSQL wire protocol so the client driver support is there, plus connector support for Kafka Connect should we wish to make use of that.

Having said all that we in the end settled on Spanner for a number of reasons. It’s a managed solution that also makes it very easy to provision and scale with tools like Terraform, it has very strong consistency, it’s easy to trigger encrypted backups, and a bonus of also being able to connect it to GCP’s Dataflow product for any additional processing we might want to do.

Conclusion — The Final Architecture

The following diagram represents the flow of data through our new data pipeline.

This flow on the whole seems quite simple, and that is what we should be aiming for. It is important to go through the whole process we have detailed to make sure that we do not end up with an architecture that is overly complex and tries to do too much. Complex systems over time can become difficult to maintain and debug, simplicity is key to allow a system to scale easily.

The main takeaway is that for any system build we should be questioning our decisions at every step to make sure what we are building does not become convoluted and difficult to maintain. Keep it simple.