How we keep track of photo downloads at Unsplash
Whether you’re a photographer on Unsplash or you just use it for its resources and community, you’ve probably noticed that we report the number of downloads each photo gets, as well as the evolution of that number recently. If you’re a photographer, you’ve also seen the aggregated count of downloads across all your photos. These numbers usually refresh every 10 minutes.
Question is, do you know how this happens?
The Unsplash API is one of the keys
First, you need to know what a “download” really represents. To get there, I’m presenting you our API. The Unsplash API is open to every developer and can serve any application that requires it with different services like getting photos (random, search, recent …) or downloading them. What’s important is that the API not only serves 3rd party applications but it also serves our website unsplash.com, our different apps (iOS, Android), our browser extension (Unsplash Instant) and our TV apps.
This means that when a photo gets downloaded, anywhere in the Unsplash ecosystem, in any app or website that uses our API, the API is alerted. We work with API applications to fire a request to our API whenever a user performs a download action. The Unsplash API handles about 8 photo downloads per second across all of the ecosystem. That’s 60,000 different photos being downloaded 700,000 times every day.
Whenever the API delivers a photo for a download, it fires 2 requests to our data pipeline.
But why 2 requests?
The data pipeline: a pseudo lambda architecture
The first request goes to our classic data pipeline: the request hits a server which produces a log. Once a day, all these server logs are parsed to extract the data which is then stored in our data warehouse. We call this data setup the “batch pipeline”.
The second request goes to our stats server: the stats server receives the request and lists it in a Redis index. Every 10 minutes, an ETL job extracts the data from Redis and stores it in our warehouse.
That second setup is what makes us able to update the downloads count for each photo and the aggregation per photographer every 10 minutes. We wrongfully call it the “realtime pipeline”. The problem is that this pipeline is not as reliable and is much more expensive than the batch pipeline.
To solve the lower reliability concern, at the end of the day, the batch pipeline overwrites the data from the realtime pipeline, repairing some rare but possible missing downloads. From then, a new day starts on a fresh base: the counters continue to follow the 10 minute update rule, until the next batch processing runs and replaces the values for the day. And it goes on, and on, and on …
We call it a pseudo lambda architecture because in a classic lambda architecture, only one request is sent and the server is the one splitting that request and sending it into the 2 pipelines (realtime and batch).
We could manage to make our realtime pipeline be … realtime. We could do this by using some technologies like ElasticSearch but we made a choice to use a smaller number of services to simplify our architecture. This is not definitive and we might move to an actual realtime pipeline at some point, when we’ll have a true need for it and the time that’s necessary to implement it.
What follows is for the techies. I’ve mentioned “a smaller number of services”, let me take you inside the tech stack that allows us to track photo downloads.
What’s under the hood?
Let’s not talk about our API because I wouldn’t be able to honor it as it should be. I’ll just mention that it’s mainly resting on Ruby. Props to Bruno Aguirre, Aaron Klaassen, Roberta Doyle and Luke Chesser for #TeamAPI.
Instead, let’s discuss a bit about our data architecture.
The batch pipeline
Our batch pipeline is powered by Snowplow. It’s an open-source data pipeline that lets you collect, process and store data. What you collect, how you process it and where you store it is entirely up to you. It’s a very flexible tool that makes you the only owner of your data.
We deployed the Snowplow pipeline in Amazon Web Services (AWS).
An Elastic Beanstalk instance is the event collector. It’s a web server that logs any incoming request and stores these logs in Amazon S3. The API sends it a request anytime a download happens in the Unsplash ecosystem. The collector collects all sort of events, currently logging 350 requests per second at peak times.
From there, an EC2 instance (that we also use for proxy purposes) runs a daily cron job that fires up a brand new Elastic MapReduce cluster responsible for processing the logs in multiple steps:
- Pulling the collector logs from S3
- Enriching the pulled logs with custom processes (string splitting or encoding/decoding for example).
- Shredding the enriched data to format it as a JSON
- Loading the JSON formatted data into our warehouse
This EMR job from Snowplow also saves backups of the state of our data after each step. If something fails, we can always reprocess it.
The (almost) final destination for our data is our warehouse: an Amazon Redshift cluster. The cluster contains pretty much all our website, apps and API activity. For the AWS geeks, it’s a Redshift cluster currently made of 21 dc2.large nodes for a capacity of 3.3TB of compressed data.
The realtime pipeline
The “realtime pipeline” is more of a homemade solution.
A web server living on Heroku receives the requests sent by the API when there’s a download, transforms them into readable events and logs them in Redis.
Every 10 minutes, a scheduler process (living on Heroku as well) enqueues a “handle realtime downloads” task in a Redis queue. When our worker process finds time, it’s going to take the task out of the queue and process it. That whole scheduler/queue/worker setup is built in NodeJS with node-resque.
In that case, what the worker does is pulling the downloads out of Redis and inserting them straight into our warehouse, the same Amazon Redshift cluster we mentioned previously.
If you ask me “why not send the downloads directly there instead of buffering in Redis?”, I’ll answer that Redshift doesn’t couple well with really frequent inserts. Bulking the inserts also helps lowering the load on the cluster that needs to perform well on a lot of other tasks.
The rest is pretty straight forward. All our data, whether it’s “realtime” or batch processed, lives in our warehouse, in different tables. By querying our warehouse, we can tell exactly how many times each photo has been downloaded and we can aggregate these counts by user.
The only issue remaining is that our warehouse does not perform well enough to answer the thousands of requests for download stats we get each hour fast enough for a good user experience. So we need to move the data … once again.
For a faster delivery of stats, we use a PostgreSQL database living in Heroku. We gave it the really fancy name of “stats database”. Its total size is about 40GB but it holds more than just downloads counters. An ETL job that runs every 10 minutes pulls the downloads from our warehouse, transforms them in a performant format and pushes them into our stats database. This database is the leader of a “read-only” follower that our API queries to get up-to-date stats. Our website asks our API for the most recent stats and displays it for you to see.
And boom! This is how we track and update photo downloads at Unsplash.