The data stack at Unsplash
Here at Unsplash, none of our data processes is built within any of the main products (website, API, etc …). They’re all gathered in a specific project that we call Unsplash-data. Like any other product, Unsplash-data uses different third party services to manage, enrich, store, visualize and distribute data.
Let me take you on a tour of our data stack.
First of all, we need to collect data from all our different applications.
Snowplow is an open-source data pipeline that offers different solutions to collect data from different platforms.
It’s ideal for Unsplash because we have a lot of very different products. It includes all the tools to setup a pipeline (either batching events or real-time) to collect, enrich and store your data, all that inside your own infrastructure.
Our Snowplow pipeline is running on Amazon AWS. It contains:
- An Elastic Beanstalk web-server that receive events and produces logs
- A set of S3 buckets to store the web-server logs and backup the different stages of our data processing.
- Elastic MapReduce clusters that get fired to process, enrich and format the logs we collected
- A Redshift warehouse to store the processed data (access via SQL)
- An EC2 box that schedules the data processing and also works as a proxy
Thanks to Snowplow, we completely own our data. We’re free to modify, archive, derive or replay it exactly the way we want.
The data processed by Snowplow ends up in our main data warehouse, a Redshift cluster.
Redshift is a warehouse that you can query with SQL (a limited version of PostgreSQL with less/different functions). Read and write operations are free and you pay only for the volume and computing power you want available. It’s an easily scalable solution and like most warehouses, it uses a columnar storage. In short, this means it can access millions/billions of rows very quickly if you limit the number of columns read in your query.
The ability to query it via SQL makes it an awesome solution when it comes to developing around it and integrating it in our other systems or third party services.
We also have other data stores for other usages.
BigQuery is another warehouse we use. We use it more as a intermediate storage than a final data storage. It’s a destination for logs but also a data source for the rest of the pipeline. We read from it, compute calculations and store the result elsewhere. It’s where our own API logs and all the photo views end up before being processed.
What’s interesting about it is that the computing power is dynamic and handled by the system itself. You don’t have to worry if the BigQuery cluster is going to be powerful enough, it will be. The price you pay is based on the volume of data that you read and write. Storage itself is cheaper (0.01–0.02$ for 1Gb). The pricing model is completely opposed to Redshift’s.
So depending on our use case, we might use Redshift over BigQuery or the other way around to lower costs.
Another data source for the Unsplash-data system is simply the product database, hosted on Heroku, where all the photos, user accounts, collections, API applications, etc … are stored. The data system reads from it to keep an up-to-date copy of certain data models within itself but also to compute calculations or transform the data in a more readable format.
The data system has its own PostgreSQL database that is also hosted on Heroku. Everything data-related is stored in there. Whenever the API needs to access data like views or downloads, it will query this database.
Once all the data sources are plugged-in and the data available in our PostgreSQL database, we use different services to enrich the data.
Enriching the data means collecting more data purely based on the data you already possess. It’s like deducing new data.
For example, you can generate tags for a photo by using an AI service and then store the generated tags with your photo. You “enriched” your photo with tags.
Google Vision allows all kind of detections on your photos. We use it to extract colours, detect landmarks and evaluate the content to understand if it’s somehow NSFW or violent content.
A nice example is that we used the enriched data from Google Vision to build Unsplash Landmarks.
Cloudsight offers a nice feature called “Whole scene description”. It’s an API that analyses the content of your photo and formulates a sentence describing it. We grab and store that formulation, mainly for accessibility and to provide an alternative text if the image has a loading problem.
Rekognition works a bit like Google Vision, except that we only use it to generate tags and keywords. Rekognition analyses the content of the photo and provides a set of keywords with a confidence level that describe your photo. This is one the first stones of our search engine.
We picked Amazon Rekognition over its competitors because we found that its confidence level is well calibrated and the list of keywords offers a solid and conservative base on which we can safely build an accurate search engine.
The Geocoding API from Google helps us in enforcing the format of locations that user enter for their account or their photos. When sending an unformatted location like “NYC” to the Geocoding API, it will provide a formatted version, splitting city name, county, state, country etc …
We store that formatted version to be able to query our data based on these different parameters.
More than a visualization tool, Looker allows anyone in the company to explore the data available in our different data stores and create dashboards.
This exploration is automated and managed by a set of data models that you can generate or write yourself. These models can be persisted in your warehouse for faster retrieval of the data.
You can control the limits of the exploration that your teammates can do. Controlling how people access the data and what they do with it is key. Not because you want to censor things, but because you want to be absolutely sure that they can’t make mistakes.
Mistakes create uncertainty which leads people to stop looking at the data out of fear of taking decisions based on wrong data.
The models you build can also clarify the data for your teammates. For example, your model can interpret a set of integer statuses (1, 2, 3, 4) as readable statuses (submitted, review, accepted, deleted).
Looker is plugged to our different data stores but mainly to our Redshift warehouse where all our product events end up and where our calculations and enrichments are backed up daily.
We’re doing most of our analysis with Looker but sometimes it’s not enough. When we need to dig more or lead true research projects, we create data notebooks.
If you’re a data analyst you probably know what a data notebook is. For the normal people, it’s a document that can regroup code, data visualizations and textual explanations to show and explain how a specific research is led, step by step.
Google Colaboratory is an engine that allows you to collaborate on a data notebook. So if you work in a team on a single research project, you can edit the same notebook without having to synchronize your code. Collaborative code editing is tricky but Colaboratory is also a nice and easy way to share your notebooks with people without having them to run a notebook engine.
The Unsplash Lab
Data engineering and analysis is not only for business intelligence or for stats. Data can also be used to create new features, based on user preferences, on similar content suggestion, on learning patterns etc …
This type of research is also very present at Unsplash. To make sure each project ends up in being something concrete that the team can try and play with, we built a small UI called Unsplash Labs. Unsplash Labs showcases all the different prototypes that we build when researching new product features.
The lab not only shows the prototype but also links it to code in Github, cards in Trello and interesting related documents that we found on the internet while researching. That way, anyone in the team can test these prototypes, learn more about them and give feedback.
We’re thinking about opening our Lab to the public, it would require some work but it might very well happen at some point…
And just like that… the data stack tour is over! If you have any question, feel free to ask in the comments! We’ll be happy to answer as soon as possible.
Thanks for reading me!