The data stack at Unsplash

Timothy Carbone
Unsplash Blog
Published in
7 min readSep 14, 2018

Here at Unsplash, none of our data processes is built within any of the main products (website, API, etc …). They’re all gathered in a specific project that we call Unsplash-data. Like any other product, Unsplash-data uses different third party services to manage, enrich, store, visualize and distribute data.

Let me take you on a tour of our data stack.

Unsplash data stack interactions (main data pipeline in hard black)

First of all, we need to collect data from all our different applications.

Data collection

Snowplow

Photo by Dan Cook on Unsplash

Snowplow is an open-source data pipeline that offers different solutions to collect data from different platforms.

It’s ideal for Unsplash because we have a lot of very different products. It includes all the tools to setup a pipeline (either batching events or real-time) to collect, enrich and store your data, all that inside your own infrastructure.

Our Snowplow pipeline is running on Amazon AWS. It contains:

  • An Elastic Beanstalk web-server that receive events and produces logs
  • A set of S3 buckets to store the web-server logs and backup the different stages of our data processing.
  • Elastic MapReduce clusters that get fired to process, enrich and format the logs we collected
  • A Redshift warehouse to store the processed data (access via SQL)
  • An EC2 box that schedules the data processing and also works as a proxy

Thanks to Snowplow, we completely own our data. We’re free to modify, archive, derive or replay it exactly the way we want.

Data storage

The data processed by Snowplow ends up in our main data warehouse, a Redshift cluster.

Amazon Redshift

Redshift is a warehouse that you can query with SQL (a limited version of PostgreSQL with less/different functions). Read and write operations are free and you pay only for the volume and computing power you want available. It’s an easily scalable solution and like most warehouses, it uses a columnar storage. In short, this means it can access millions/billions of rows very quickly if you limit the number of columns read in your query.
The ability to query it via SQL makes it an awesome solution when it comes to developing around it and integrating it in our other systems or third party services.

We also have other data stores for other usages.

Google BigQuery

BigQuery is another warehouse we use. We use it more as a intermediate storage than a final data storage. It’s a destination for logs but also a data source for the rest of the pipeline. We read from it, compute calculations and store the result elsewhere. It’s where our own API logs and all the photo views end up before being processed.
What’s interesting about it is that the computing power is dynamic and handled by the system itself. You don’t have to worry if the BigQuery cluster is going to be powerful enough, it will be. The price you pay is based on the volume of data that you read and write. Storage itself is cheaper (0.01–0.02$ for 1Gb). The pricing model is completely opposed to Redshift’s.

So depending on our use case, we might use Redshift over BigQuery or the other way around to lower costs.

Heroku PostgreSQL

Another data source for the Unsplash-data system is simply the product database, hosted on Heroku, where all the photos, user accounts, collections, API applications, etc … are stored. The data system reads from it to keep an up-to-date copy of certain data models within itself but also to compute calculations or transform the data in a more readable format.

The data system has its own PostgreSQL database that is also hosted on Heroku. Everything data-related is stored in there. Whenever the API needs to access data like views or downloads, it will query this database.

Data Enrichment

Once all the data sources are plugged-in and the data available in our PostgreSQL database, we use different services to enrich the data.

Photo by Harry Quan on Unsplash

Enriching the data means collecting more data purely based on the data you already possess. It’s like deducing new data.

For example, you can generate tags for a photo by using an AI service and then store the generated tags with your photo. You “enriched” your photo with tags.

Google Vision

Google Vision allows all kind of detections on your photos. We use it to extract colours, detect landmarks and evaluate the content to understand if it’s somehow NSFW or violent content.
A nice example is that we used the enriched data from Google Vision to build Unsplash Landmarks.

Cloudsight

Cloudsight offers a nice feature called “Whole scene description”. It’s an API that analyses the content of your photo and formulates a sentence describing it. We grab and store that formulation, mainly for accessibility and to provide an alternative text if the image has a loading problem.

Amazon Rekognition

Rekognition works a bit like Google Vision, except that we only use it to generate tags and keywords. Rekognition analyses the content of the photo and provides a set of keywords with a confidence level that describe your photo. This is one the first stones of our search engine.
We picked Amazon Rekognition over its competitors because we found that its confidence level is well calibrated and the list of keywords offers a solid and conservative base on which we can safely build an accurate search engine.

Google Geocoding API

The Geocoding API from Google helps us in enforcing the format of locations that user enter for their account or their photos. When sending an unformatted location like “NYC” to the Geocoding API, it will provide a formatted version, splitting city name, county, state, country etc …
We store that formatted version to be able to query our data based on these different parameters.

Data Visualization

Photo by William Iven on Unsplash

Our main visualization tool is Looker. Luke already wrote about it:
How Looker saved our fundraising process

Looker

More than a visualization tool, Looker allows anyone in the company to explore the data available in our different data stores and create dashboards.

This exploration is automated and managed by a set of data models that you can generate or write yourself. These models can be persisted in your warehouse for faster retrieval of the data.

You can control the limits of the exploration that your teammates can do. Controlling how people access the data and what they do with it is key. Not because you want to censor things, but because you want to be absolutely sure that they can’t make mistakes.
Mistakes create uncertainty which leads people to stop looking at the data out of fear of taking decisions based on wrong data.

The models you build can also clarify the data for your teammates. For example, your model can interpret a set of integer statuses (1, 2, 3, 4) as readable statuses (submitted, review, accepted, deleted).

Looker is plugged to our different data stores but mainly to our Redshift warehouse where all our product events end up and where our calculations and enrichments are backed up daily.

Data Analysis

Photo by Franki Chamaki on Unsplash

We’re doing most of our analysis with Looker but sometimes it’s not enough. When we need to dig more or lead true research projects, we create data notebooks.

Google Colaboratory

If you’re a data analyst you probably know what a data notebook is. For the normal people, it’s a document that can regroup code, data visualizations and textual explanations to show and explain how a specific research is led, step by step.
Google Colaboratory is an engine that allows you to collaborate on a data notebook. So if you work in a team on a single research project, you can edit the same notebook without having to synchronize your code. Collaborative code editing is tricky but Colaboratory is also a nice and easy way to share your notebooks with people without having them to run a notebook engine.

The Unsplash Lab

Data engineering and analysis is not only for business intelligence or for stats. Data can also be used to create new features, based on user preferences, on similar content suggestion, on learning patterns etc …

This type of research is also very present at Unsplash. To make sure each project ends up in being something concrete that the team can try and play with, we built a small UI called Unsplash Labs. Unsplash Labs showcases all the different prototypes that we build when researching new product features.

Unsplash Labs homepage: Lists the most recent research projects

The lab not only shows the prototype but also links it to code in Github, cards in Trello and interesting related documents that we found on the internet while researching. That way, anyone in the team can test these prototypes, learn more about them and give feedback.

Research project details: Prototype to play with + related code and documents

We’re thinking about opening our Lab to the public, it would require some work but it might very well happen at some point…

And just like that… the data stack tour is over! If you have any question, feel free to ask in the comments! We’ll be happy to answer as soon as possible.

Thanks for reading me!

--

--