Make your own pixel tracker with AWS Kinesis Firehose and AWS Redshift

AWS Kinesis Firehose and AWS Redshift are some of my favorite features of any cloud platform. This stack allows you to easily track millions of events per day without any devops. This guide is an in-depth, devops-orientated guide for deploying this analytics stack for the purpose of tracking pixels, though it can be extended to whatever you want.

What are tracking pixels

Tracking pixels, also called web beacons, are images that send data to an analytics server via query strings. They could look something like:

<img src="https://analytics.example.com/1x1.gif?event=load" hidden>

This pixel track a “load” event, which is when the page is loaded by the browser. The browser will load this image when it reaches this point in the HTML page, but the image is actually hidden on the page as the image itself is not important — the HTTP call is. By using an <img> element instead of using JavaScript, we can guarantee that the image gets loaded by the browser.

Query string parameters are sent with the image to track various parameters. In this example, we have an event parameter to certain events, in this example, load. In adtech, there are what we call “macros” that are essentially variables for query strings:

https://analytics.example.com/1x1.gif?event=[EVENT]

In this example, [EVENT] is a macro for the event parameter. The above URL is then passed to partners, who replace the macros with whatever appropriate. In this example, we pass a URL with a macro to the partner, then our partner can track as many events as they want with a single URL. This is much easier than sending over 100s of pixels, one for every event. Here’s an example of how a partner could fire this pixel:

function callWebBeacon(url, event) {
url = url.replace(/\[EVENT\]/ig, event)
const img = new Image()
img.src = url
img.style.display = 'none'
document.body.appendChild(img)
}

If you’re working with JSON tracking data or redirects, this guide is very similar. You really only have to edit the web server, but not the rest of the track.

What is AWS Kinesis Firehose and Redshift?

Redshift is a column-oriented database based on PostgreSQL 8.0, trading a lot of the flexibility of PostgreSQL with sheer speed. It’s tuned for large analytics workloads, and because it’s based on SQL, writing queries and loading data is very easy, especially if you’re already using SQL. Best of all, there is almost no maintenance for your Redshift clusters.

Kinesis Firehose is a data stream that supports loading data into S3 then Redshift, essentially managing the entire ETL process. If you didn’t use Kinesis Firehose, loading a lot of data into Redshift would be a pain in the ass as there are a lot of restrictions you would have to work around.

With this ETL pipeline, your data is saved to S3 and loaded into Redshift at a minimum of 1 minute intervals. It’s not real-time streaming, but you might be able to use AWS Kinesis Analytics for that.

Creating your Redshift database

Creating a Redshift cluster on AWS is similar to creating a database on AWS RDS. One of the first things you should consider is whether to use SSD or HD instances. SSD instances are a lot faster, but store a lot less data per price point. I’ve found HD instances to be significantly slower — my queries seem to be bottlenecked by HD speed versus CPU or memory. I’d really only recommend HD instances if you don’t run queries very often or you are cost restricted.

One important question when you make your cluster is how many instances to deploy. If you make a multi-node cluster, you can’t convert it to a single-node cluster. This means you can’t go from 4 small instances to 1 big instance; you’d have to go to 2 big instances. However, some people have said that if you define your distribution and sort keys well, multiple small instances are actually faster than fewer big instances.

Be sure to set an appropriate VPC for your cluster and create a CloudWatch Alarm for when your cluster runs out of space. I always keep my cluster at less than 50% capacity. I’ve hit issues where my queries don’t fit in memory and begin swapping disk space, but then I run out of disk space, slowing my servers to a crawl.

Creating your Redshift table

The first step is designing your schema. Redshift schemas is similar to PostgreSQL schemas except for indexes:

  • There are no unique or sparse indexes.
  • Consider your distribution key carefully.
  • Consider your sort key carefully. Use a compound or interleaved sort key if you know your queries already.

Because Redshift does not have unique indexes, you could get duplicate data. You could query around duplicate data, but what might be better is to create a job that looks for duplicates, then deletes it. This won’t be covered by this guide as it depends on your data structure.

Your distribution key is how data is distributed across all your Redshift servers. This could be used as a filter — if Redshift knows all the data is on one server, then it could just use one server to run the query. However, there is an issue with skew — if you choose a bad distribution key, data could be skewed towards certain servers and those servers might run out of disk space faster.

Also consider your queries — if you frequently query against a few keys of a column, you might not want to set it as the distribution key as your queries could only be running on a few of your servers. You’ll end up not utilizing all your servers! However, this is less of an issue if you use fewer gigantic servers versus using a lot of tiny servers.

The difference between a compound and interleaved key is that a compound is a prefix index whereas an interleaved is not. If you have a bunch of mutually exclusive filters, use an interleaved sort key. If you always query with a few filters, set those as the compound sort key.

Here’s the simplest table we can make for our event table. We use a simple distribution and sort key:

CREATE TABLE events ()
id VARCHAR(36) NOT NULL PRIMARY KEY,
event VARCHAR(20) NOT NULL DISTKEY,
received_at TIMESTAMP NOT NULL SORTKEY
);

In our example, we use event as our distribution key, effectively giving it an index.

Creating your Kinesis Firehose

Time to create your Kinesis Firehose! The specific Kinesis Firehose options you should be aware of are:

  • S3 Compression: GZIP so that your store the data compressed, saving space and money
  • Error logging: Enabled

Your COPY options should be as follows:

FORMAT AS JSON 'AUTO' GZIP TRUNCATECOLUMNS MAXERROR 100000

Let’s explain these options:

  • FORMAT AS JSON 'AUTO': assuming you are sending JSON, this inserts data into the row based on the JSON fields.
  • GZIP: because we are storing gzipped data, this flag tells Redshift to gunzip data before reading it.
  • TRUNCATECOLUMNS: if text are too long for a column, truncate the text instead of erroring.
  • MAXERROR 100000: continue loading the data until it hits this amount of errors. Otherwise, a single error would abort your entire load!

Tracking load errors

Sometimes you’ll see that tracking worked, data was saved in S3, the SQL query that loads data into Redshift, worked, but your table doesn’t have new data. This is probably because there were errors in your data itself! For example, if you have a NOT NULL column, but try to load a JSON object without that column, the load will fail.

There are a couple of ways to look at failed loads. One way is to use the following query to see the latest load errors:

However, I personally just look at the AWS UI. You could look at the Kinesis Firehose’s logs or go to individual “loads” in the Redshift’s load tab. These are all ways to see why your data isn’t being loaded.

The reason we load data with MAXERROR 100000 is so that good rows are still loaded when others might have errors. Otherwise, none of your data would be loaded at all. But you might want to disable this option while developing your ETL so that you can more easily track errors.

Creating your tracking server

Your tracking server could be as simple as the following:

This is the bare minimum tracking server you can write. In your production application, you’d probably want to add validations and transformations on the query before you insert it into the database. For example, you’d probably want to track the ip address and user agent string.

Of course, you’ll probably want to use an actual web framework to do above to add more features. However, for this purpose of this guide, I will avoid adding complications and opinions. You’d also probably want more routing, not just accepting any route as a pixel.

bulk-insert is a module I wrote for this batching requests. It buffers data, in this case up to 500 at a time, and flushes the data at a maximum interval, in this case 300ms. The reason you want this is to avoid sending 1 HTTP request for every track event, which would cause a lot of overhead and be less performant.

To batch our requests, we use putRecordBatch(). Records are strings, so we JSON.stringify() every track call, which is just a JSON object with a trailing \n. To save money, you could combine multiple records into a single record (\n-separated JSON objects), but that is beyond the scope of this guide.

In production, you would want to handle graceful shutdowns, otherwise you might lose tracking. Fortunately, you won’t need much more than a second to cleanup requests and many platforms, such as Azure Web Apps or AWS Elastic Beanstalk, handle graceful shutdowns for you already.

Calling your pixels

If you need to call your pixels with JavaScript, here’s the simplest way (using browserify/webpack):

import { stringify } from query-string'
function track(data) {
const img = new Image()
img.src = 'https://analytics.example.com?' + stringify(data)
img.style.display = 'none'
document.body.appendChild(img)
}

Maintaining your database

It’s important to make sure you chose the right distribution key. One way to check if you chose the right key is to check the skew:

Over time, you’ll notice that your requests become a lot slower and slower. One of the reasons is that your data, probably the newer data, is not stored sorted. To check how much of your data is unsorted:

If your tables have more than 5% unsorted rows, then you should vacuum your database, which will sort your rows and reclaim deleted space. Create a script that runs the following SQL command on your Redshift server:

VACUUM FULL;

I vacuum my databases nightly, which only takes an hour or so. If you only vacuum on the weekend, it might take a few hours, so it depends on your business requirements as vacuuming may slow your queries.

Manage query workloads

You’re probably going to have a lot of different types of queries hitting your server all the time:

  • Data loading from Kinesis and other sources
  • Querying data when performance does not matter
  • Querying data as a user
  • Vacuuming as the super user

When you have so many types of queries, you may want to implement workload management so that your queries run in the priorities you prefer and so you can protect your server against contention.

For example, your workload management might look like:

  • User Group: loader, Concurrency: 2, Memory: 5%
  • User Group: cron_jobs, Concurrency: 2, Memory: 20%
  • User Group: data_scientists, Concurrency: 4, Memory: 75%

We allocate 5% of memory to loading data as it doesn’t need much memory to load data into the server and it can be slow. Similarly, cron jobs run frequently, so we don’t mind if they’re slow.

We then allocate 75% of memory to queries data scientists run because they might be actively waiting for their results. Depending on how often data scientists are running queries, this may be too much or too little memory. This great part about WLM is that if the servers are running too many loads or too many cron jobs, data scientists will still be able to work as a large portion of the servers’ memory is dedicated to them.

You should not be running all your queries as the superuser. If, for some reason, too many requests are hitting the server, some queries such as loading data won’t work, causing you to not load data fast enough anymore. Or perhaps your data scientists wouldn’t be able to work at all. I, however, only setup WLM on my production servers and don’t bother with my staging servers.

Whether to use AWS API Gateway

What’s cool about API Gateway is that you could set it up as an AWS Firehose proxy. However, I’ve decided not to use this for a couple of reasons:

  • You cannot add data validation or custom logic when you use API Gateway and Kinesis Firehose alone.
  • You could add logic with AWS Lambdas, but it will actually cost you a lot more money than if you just use actual servers.

Data validation and formatting is very important when your data is dynamic and could be sent from people other than you. In adtech, pixels are images that are fired for tracking. AWS Lambda can be very expensive for this use-case. One reason is because AWS Lambdas are priced per 100ms, but these API calls wouldn’t take that long, so you are just wasting money. Additionally, you are making your server more efficient than Lambdas by batching requests.

I personally prefer throwing errors on my server if my data is bad. During testing, I will know I am sending bad data, and my production data will always be “clean”, making my queries simpler and faster. However, if you want to do data validations in your SQL queries, then consider just piping data straight into Kinesis Firehose using API Gateway.

Conclusion

AWS Kinesis Firehose and Redshift is a pretty straightforward and price-effective way to build an analytics ETL for your company. There are some devops cron jobs and schema designing you have to do to deploy a Redshift cluster successfully, but, afterwards, all you’d have to do to maintain your servers is add more instances to your cluster!

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.