Using realistic data locally

Anson Kelly
Carwow Product, Design & Engineering
3 min readSep 7, 2017

When developing features it is handy to be able to test with a “real” dataset and not just a bunch of randomly generated data.

This means it there are less surprises when the feature gets deployed — both in terms of scalability but also content.

An example that we ran into recently at carwow was adding a last page option to a paginated list. It doesn’t sound like a large task but, like most features, the devil is in the detail. If your local development environment only has a few hundred items then calculating how many pages of items there are is quick. In production, however, there could be millions of items which gives a different performance profile.

Sidenote: Running the equivalent of “select count(*) from table_name” scales very badly but that’s another topic entirely

Another example is the content — long names can overflow areas of the page and testing with a realistic dataset means less surprises like these.

Lite Backups to the Rescue

So that developers can develop with a dataset that is close to our production dataset we frequently restore backups of production data locally. Luckily for us, heroku makes it easy to generate regular backups.

There is a problem with doing this though, now that carwow has been going for over 4 years our production dataset has grown… our primary databases are well over 100G in size which is a bit much to be restoring locally so we use a compromise — we prune the dataset down to a more manageable size while still keeping the content consistent and the dataset large enough that most (if not all) performance related issues make themselves known.

We have found that pruning our data set to the last 30 days of user activity gives us a reasonable balance between keeping the lite backup size manageable while maintaining a good mix of actual data.

Stop talking and show me the code

To do this we have a basic heroku application that runs a Docker container. All we need is the heroku CLI and the postgres-client. Heres the Dockerfile:

FROM debian:jessie

RUN apt-get update && apt-get install -y --no-install-recommends git wget postgresql-client ca-certificates \
&& wget http://cli-assets.heroku.com/branches/stable/heroku-linux-amd64.tar.gz \
&& tar -xzf heroku-linux-amd64.tar.gz -C /usr/local/lib/ \
&& ln -s /usr/local/lib/heroku/bin/heroku /usr/local/bin/heroku \
&& rm heroku-linux-amd64.tar.gz \
&& apt-get purge -y wget \
&& apt-get autoremove -y \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

RUN mkdir /app
WORKDIR /app

COPY generate_quotes_lite /app/
COPY generate_dealers_lite /app/

We generate a new lite backup every night:

Using the Heroku scheduler makes running daily tasks easy

In each generate_database_lite script we use the heroku API to fork each database, prune the dataset and then generate a backup of the smaller dataset. Here is an (abridged) version:

#!/bin/bash
set -e
# Copy the database (fork)
heroku addons:create heroku-postgresql:standard-2 --fork $APP::DATABASE_URL --as $DB --fast --app $APP --region eu

until heroku pg:wait --app=$APP
do
echo " Waiting for DB fork..."
done

DB_URI=`heroku pg:credentials:url $DB --app=$APP | grep "postgres://"`
# Prune the data
psql $DB_URI -c "DELETE FROM users WHERE created_at < current_date - interval '1 month'"

... other psql statements follow to prune data ...
# Capture the backup
heroku pg:backups capture $DB --app=$APP
# Delete the forked DB
heroku addons:destroy $DB --app=$APP --confirm $APP

Those backups are then available for developers to download and use locally.

We use the heroku scheduler to generate a fresh set of lite backups each night.

This is nothing too complicated, but its a nice example of how a bit of automation can help make a developers life easier.

Like what you see? Want to be a part of the carwow tech community?

Head to our jobs page now.

--

--