Building a Data Analytics Infrastructure in 2019

Published in

ParkBee

12 min readApr 30, 2019

How to easily create from scratch a modern, reliable pipeline for data gathering, storage, post-processing, and visualisation, in a growing young company.

In other words, is it possible to reliably and professionally do ETL, Data Warehousing, and Business Intelligence on a limited budget and with limited personnel? As the Senior Data Scientist at ParkBee joining at an early stage of the company, I found that through the use of several SaaS tools it’s indeed possible to accomplish and deploy such an infrastructure company-wide, without needing to hire any in-house engineers or a large team of data analysts.

Building such an infrastructure is not traditionally part of the Data Scientist role, and I took a fair amount of time researching and field testing the various options; this article is aimed at any data analyst, scientist, or general tech-y employee in a similar position in a young company wanting to become “data-driven”, but not possessing a large analytics team in the traditional sense. In the end, by making full use of modern technologies & services, I show one way of quickly building a reliable & very low-maintenance data infrastructure, opening up data access to all employees and allowing them to focus their expertise on using the data — rather than spending time worrying about how to collect it.

This article mentions several commercial services, but contains solely the personal opinions of the author, and no agreements involving payment or otherwise were in place that affected the writing of this article; I try to be as balanced as possible. As a Scientist rather than an Engineer or a Developer, I’d also be very happy to hear comments from people who have taken a different approach!

Data Warehousing

Let me begin from the middle of the ETL → Warehouse → Visualisation pipeline. It’s necessary to have a centralised location to store data, which is capable of combining various data and handling large-volume requests to access it; generally this is referred to as the data warehouse, and ideally should contain all relevant data and be kept up-to-date, so that there’s always only one correct place to look for data for analytics purposes (the “one-truth” philosophy). This is usually an SQL database. Most tech companies have at least one SQL database for holding some information used by their product; however it’s best to separate your warehouse from your production database, as large queries can slow down performance (and you don’t want your product to go offline every time someone internal looks at a graph).

An SQL database essentially contains several tables that may be related to each other, and be structured or indexed in an optimal way to improve access speed. There are many other ways to store data, such as NoSQL (non-relational databases), JSON (a format for lists and dictionaries), or CSV (simple raw-text, comma-separated tables), but in many cases it’s possible to convert and store this information in an SQL database anyway (although strictly speaking, theoretically sub-optimal). Traditionally, such databases are maintained & optimised by a Database Administrator (DBA) — but since we’re on a budget, let’s assume we don’t have one of those, and continue…

First we should decide which SQL dialect to use. Unfortunately there are many different types & flavours of SQL, and at ParkBee we actually use two; a PostgreSQL database hosted on AWS (Amazon’s cloud computing service) for most general data warehousing, and a Snowflake database for a high-volume (billions of rows) collection of one particular type of record that we store. We needed this duality because of the specific high-volume data source, but for most companies doing basic internal analytics this might not be necessary, and a single flavour may be preferable.

In fact, I’d definitely recommend Postgres in general over the many competing dialects, because it’s both technically more advanced and cheaper to run. In particular, it supports functions such as automatic series generation, direct JSON storage and access, and so on — which you might not need at first, but may find handy later. Companies have also been known to save 80% in database costs switching from e.g. MS SQL Server to Postgres. Like most dialects, it stores data row-by-row, which makes it particularly fast at handling smaller datasets interconnected by complicated joins (although in this case, we’re still talking about say a million rows and several joins before any delay becomes noticeable). A Postgres database can be easily set up on AWS, and accessed through a desktop client such as the awkward-but-powerful, dialect-agnostic DBeaver (or many others; take your pick). There are of course plenty of tutorials that can be followed for things like creating a database and some tables in it — but since we’ll mostly be importing data from other sources, we probably don’t even have to worry about that yet!

Snowflake, on the other hand, stores data column-by-column, which makes it much faster when performing simpler combinations on huge data amounts (for example, taking the average of one field over a billion rows can still complete within a split second). Competitors to Snowflake include Redshift and BigQuery; since they have various different optimisations under the hood, we simply started up some trial databases, put some of our data in there, and tested which had the best query times on our particular data structure. Snowflake is also nice due to it’s self-scaling elastic nature; it only gives you (& charges you for) the processing power & storage that you need, scaling up and down literally within seconds, so if your usage is sporadic then it can save a lot on operating costs. Finally, though it has its own dialect, it’s fairly similar to Postgres; while not quite as advanced, there are some other potentially nice advanced features like the ability to schedule & run data post-processing in SQL internally within the database itself (however, we don’t use this at ParkBee — personally I feel that SQL is not itself always the best language to be performing any kind of complicated data post-processing, and in any case we have a different tool for SQL post-processing, as will be explained later).

So, assuming we’ve set up a database now, what next?

ETL — Extract, Transform, Load

The next question is — how do we get our data into the warehouse in the first place? Traditionally, you hire a bunch of Data Engineers who spend hours writing and maintaining complex ETL code: to Extract data from a given source (e.g. another database, APIs, local files, websites…), Transform it into whatever shape is required or desired by the warehouse, and finally Load it in to the warehouse.

In some cases the Transform stage includes a lot of data pre-processing; for example, simplifying or aggregating data, or performing other calculations on the data. This was especially true in the past, but at ParkBee we do essentially no transformations on our data at the ETL stage. Of course, bandwidth and data storage are so cheap, so why worry about aggregating at this stage? Why not simply store the data, as you never know when it might be useful; especially when your analysis goals and the end data that you want are very likely to change in the future.

In any case, it’s even debatable whether engineers should or shouldn’t write ETL at all, especially when data is being pulled from a common source, such as Google Analytics or another SQL database. It turns out that there are several companies which can do the job for you, such as Stitch, Fivetran, or Etleap.

At ParkBee, we chose Fivetran ; without going into too much detail, the pricing was in the mid-range between competitors (including significant discounts for start-ups), which is partly because they don’t provide unnecessary extra transform functionality, and the interface was incredibly simple and easy to use.

We have a co-marketing study with Fivetran which successfully runs through the main buzzwords, but to cut a long story short; within a few minutes of creating a trial account, we were able to connect to our backend MS SQL Server database, copy the data into the warehouse, and set it to update new data every 5 minutes. Within a few more hours we were able to bring in data from Facebook, Google Ads, Google Analytics, LinkedIn, and SalesForce. In combination with the raw data from the product backend, these datasets almost instantly provide a real goldmine for e.g. investigative analysis into the correlations between marketing, sales, web traffic, and product usage, automated inter-departmental reporting, and so on. Over time, we even added some Google Sheets integrations, which to be fair is a bit of a messy solution, but if carefully managed works well in some specific cases (depending on priority, business case, time to create a proper backend feature, etc).

Unfortunately, for full transparency, we did run in to some trouble with some of the lesser-known connectors offered by Fivetran; namely DynamoDB and MongoDB. While their support was generally very responsive, some issues were unable to be solved; it seems clear that the company prioritises their core set of more common connectors. Of course, the general disadvantage of an ETL company is that at some point you’re probably going to want to bring in data from a proprietary API or some kind of custom service; only the much higher-end providers are able to help with this.

Nonetheless with Fivetran, at a fraction of the salary of a single Data Engineer, here we are with a fully-maintained, real-time ETL stream, and have managed to fully-stock our database within a few hours.

BI — Business Intelligence

There’s all kinds of general information out there on how to approach BI (i.e. analytics visualisation) and BA (Business Analytics, the same thing but more predictive), so let me here discuss only briefly how our BI solution fits in with the other SaaS tools described so far.

We considered both PowerBI and Tableau, but eventually chose Looker as our BI tool, for the following key reasons.

The ability to define a single, fixed set of metrics, by combining any data in the warehouse using inbuilt SQL post-processing (real-time or cached).
The ability to make these metrics available to users in a GUI, and allow them to build their own visualisations and dashboards.
A fully-featured REST API with a Python wrapper, allowing e.g. Machine Learning applications to also make quick and easy use of these metrics.

The above three points sound pretty fantastic. Point 1 is great as the post-processing abilities combines well with our previously-stated “load everything” ETL philosophy, and the “one-truth” metrics philosophy where we attempt to maintain one single, correct source for all reporting. Point 2 is great as it seems to enable our entire company to be “data-driven”, despite employing only two Data Scientists and one Data Analyst, by making this metric store fully interactive to all users (Looker calls points 1 & 2 together the “third wave” of BI). Point 3 is particularly exciting for myself and the other Data Scientists, who hoped that by establishing this infrastructure, it would make our analyses and deployed models much easier to build.

It almost seems from the above points that by adding Looker on the end, we can easily finish building our infrastructure right away. However, don’t get me wrong; while I really do recommend Looker, it’s not a perfect one-size-fits-all solution, and the above is easier said than implemented. In fact, this is the most time-consuming part of building the infrastructure, and there are a few hangups that have to be considered.

Looker is entirely SQL-based under the hood. SQL is a powerful query language but not a programming language, so it’s incredibly cumbersome for certain classes of problems or data post-processing tasks that might otherwise have been simple.
You have to learn the “LookML” scripting language, and wrap all your SQL data sources in it. It’s easy to learn and write, but does add a bit of extra hassle and maintenance requirements. Furthermore, it can be difficult to really build your metrics in a way that exposes everything that users want.
The “everyone is an analyst” mentality can work, with some users more than others, but takes a lot of continuous effort and training internally. (Our Junior Data Scientist, Luc Koppens, has already written more on this topic.)
The REST API is fairly slow, and the Python wrapper is not kept up-to-date, making it useful only for performance non-critical applications.

All in all, it takes significant time, effort, and cost to mature the platform inside a company; it also requires ongoing maintenance, attention, and user training. I’d recommend having at least one employee with data analysis and programming skills working with the platform full-time for the first 6 months of the project, and later on at least one or two days a week. Of course, on the other hand, the platform has the huge benefit of simply removing the need for a team of full-time analysts to be constantly reporting simple analyses to business.

As for pricing; this varies per company, and we can’t disclose exactly how much we pay, but suffice it to say… it’s orders of magnitude higher than its competitors Tableau and PowerBI. To be fair, this can be understood due to the unique and advanced approach, incredibly high-quality 24-hour live support, and general user-friendliness. Nonetheless, for smaller startups, it may simply be too costly compared to the alternatives. Looker also doesn’t automatically win on every single point; Tableau, for example, has a much more fully-featured visualisation system.

It’s clear that there’s a trade-off in the above two paragraphs; I didn’t forget that this article is describing how to minimise cost and headcount, and I’ve just implied that Looker doesn’t perfectly fit either of these criteria. Unfortunately there’s not (yet) a magic bullet that can open up all your internal data for exploration without either some financial cost or manual work. In the end, we consider our combined Fivetran / Looker implementation to have given us the most “bang for the buck” and been the correct choice for our use case — but I do consider some alternative choices below.

Conclusions

I’ve described one possible complete ETL → Warehouse → BI pipeline that can be set up initially within days (starting with trial accounts), deployed internally within weeks, and fully fleshed-out within a few months — all for a minimal cost. It completely avoids internally writing and maintaining large amounts of codebases for e.g. API access, SQL database input, data post-processing, and tying them all together and internally monitoring the status (e.g. DAG workflows), and furthermore avoids hiring a team of analysts to jump in every time a new graph or visualisation is required.

I’ve also mentioned several SaaS providers other than the ones chosen by ParkBee, and your choice obviously depends on your use case. For example, you might be able to further reduce costs by doing some data pre-processing as part of a more complex ETL process, but then using a simpler BI tool that directly connects to the tables in your data warehouse (Etleap & Tableau…?). On the other hand you may find that this requires more ongoing visualisation work by one or more Data Analysts in the future. Or, if you use a database like Snowflake with internal SQL query scheduling, you might find it easy to do data processing completely in-house, and rely on a simple ETL with a free plan (Stitch…?). The choice depends on your current in-house technical abilities, the projected number and type of data consumers, and so on…

Finally, I’ve also tried to be clear about the disadvantages of the SaaS approach itself in general; connecting only to widely-used sources, doing data processing entirely in SQL, and doing BI in a web-based visualisation engine, clearly has limits. Sooner or later, you’ll find yourself having to do some data collection, analysis, or visualisation in a more general-purpose tool such as Python, and having to deploy and share the results separately to the SaaS pipeline (e.g. in a custom web-app). If your company’s analysis needs are limited or at least more narrowly-defined, you may even find it faster and cheaper to start at this point. Beware, however, that if such a system grows, you may find yourself reinventing the wheel again.

Euan Richard is the Senior Data Scientist at ParkBee. Questions, comments, or suggestions on the article? Let us know below! If you’re interested in what we do at ParkBee, you can read one of our other articles or check out our career page.