How to build a lean data warehouse and BI infrastructure
Like most early companies we started to manually keep track of our numbers in google sheets. Obviously, this approach is hardly scalable and massively prone to human error.
Albeit the right solution in the early days of a company — the challenge is to grow the system week by week into a more robust business information system.
But where to start?
In any case you need to set-up a centralised data warehouse first, which feels like a comprehensive project to implement. I believe most companies are outright scared by the extent and efforts it takes to setup and maintain their BI system. It seems that sticking to google sheets is often the cheaper and leaner approach. This post gives a short introduction to a lean data warehouse and visualisation approach.
How to escape the google sheets massacre
Part I: You don’t want to write code
- Setting-up a cloud-based reporting database. I suggest to use PostgresQL or MySQL. AWS has free tiers for small business and an easy setup approach (https://aws.amazon.com/rds/postgresql/)
- Accessing your database and designing your database structure. I suggest to use pgAdmin (or similar free tools for MySQL) to access your database locally and set-up your reporting tables. Your initial google sheets or excel reporting structure can serve as a good glimpse on how to set-up your table structure.
- Setting-up a layer to retrieve and store data in Postgres. In a first step you can already achieve a lot using Zapier without coding. Zapier allows you to connect to any webhook / API and push data into your database routinely without coding.
With this set-up you can already aggregate a lot of external data and save it in your data base (or namely data warehouse, which is nothing more than a data repository). Some examples:
- Saving data from google sheets in your database
- Pushing data from any API through Zapier into your database
- Setting-up daily or hourly routines in Zapier to retrieve data from one of the tools you are using (webshop, Trello, email, Google etc) and pushing it into your database
- Zapier already provides a ton of apps by default. Just have a look here
How to escape the google sheets massacre
Part II: Zapier is not doing the job
At some point Zapier might not be sufficient anymore to serve as a layer between your business data and your data warehouse. In that case you need a more elaborate piece of software working as the middleware between all your data and your warehouse.
We have become a huge fan of python and their data manipulation libraries such as Pandas. If you are using python exclusively for BI purposes, I suggest to use a service like pythonanywhere to avoid code deployment or other set-up headaches. Alternatively, Google app engine comes with a lot of specific and handy google APIs built into the service.
Pythonanywhere provides you with a virtual environment and console to host and execute python scripts as well as to set-up routines (e.g. pulling and pushing data daily)
The most important libraries to get a head start with python are:
- sqlalchemy: Query databases and push data into SQL databases
- requests: Make Api requests
- pandas: Manipulate all sorts of data
- gspread: Connect to google sheets and google APIs
To pull the data from the various services you may be using, you will need to make API calls to them (if they provide an API service). The methods of authentication for API services vary and are sometimes complex, but python normally has a library with decent documentation for you to work through. Your most frequent ‘go-to’ library would be ‘requests’, but we have also used ‘sha’ and ‘hashlib’ to deal with APIs that require unique hash signatures on every call.
Certain APIs sometimes return different paginated responses, but normally the methods for using them are explained in the API documentation. After the API call to retrieve your data, you’ll normally receive a JSON file, which you can read using the ‘json’ library in python and can manipulate how you like with the rest of your code.
.. so how to visualise and manipulate your data?
Having a data warehouse and aggregating data from different sources is nice, but still cumbersome to work with. There is a ton of data visualisation and BI tools on the market. The major difference lies between simple dashboard tools visualising data or more elaborate tools, allowing also colleagues without SQL knowledge to query the data base and build customised reports and graphs.
We had a look at holistics, tableau, chartio and looker. Holistics and Chartio both provide a free test-version you can test. Unfortunately, some of the tools are getting a bit pricy for startups as soon as you add more seats. (chartio 199$ per month / seat).
The beauty with most tools is that they can be easily connected with your data base. So once you have a well structured data base, which is routinely fed with all relevant business data, you can easily manipulate data, build dashboards or schedule reports.
The entire set-up doesn’t take long. It takes you 30 minutes to set-up Postgresql on AWS. Another 30 minutes to install a local database access such as pgAdmin. The most expensive part is setting-up the database structure and Zaps to populate your tables. As mentioned before, the best way is probably to start iteratively by moving different data sources step by step into postgresql tables. The most important part being: you don’t need to be extraordinarily python literate to start building your own data warehouse.
AWS offers free tiers for postgresql, in any case it should’t be more than €50 per month. Pythonanywhere €12 per month. The most expensive part is Chartio, which starts starts at €199 per month and seat.
If you are in Berlin try our food: greengurus.de | Thanks to Jamie Burton for his input!