How to build a Data Lake in your early-stage startup

A mostly “as a Service” architecture: easy to setup, ready for scale and in a tight budget

Flavio Altinier
Pier. Stories
10 min readOct 5, 2021

--

Photo by Jake Hills on Unsplash

When to start building your Data Lake?

Every startup has been there at some point: you have been operating for a few months, sales have started gaining traction, and it looks like your business idea might be repeatable and sustainable after all.

It is time to try scaling out your business, but when you look inside, your data landscape is a complete mess:

  • Your whole operation depends on Google Spreadsheets, and there are multiple spreadsheets scattered around with important data that does not match what is in the back-end (here at Pier, at the beginning we actually used Google Spreadsheets as back-end relational databases, to give data visibility to business people).
  • Maybe you are using a micro-service architecture and there is just no way to join data between two different databases. Business people have started downloading whole tables to spreadsheets and joining them using VLOOKUP.
  • Different teams have started aggregating their data in different ways. There are multiple important dashboards on Data Studio and Power BI, that break often and no one really owns them.
  • You are using multiple third-party solutions for Customer Relations, Navigation Tracking, Marketing Attribution, and none of these tools talk to each other.

If it takes days to answer questions that should be simple, and spreadsheets are starting to break due to the number of rows, it is probably time to consider building your own Data Lake.

I won’t focus on what is a Data Lake and why it is a good solution for organizing your data in this article, as there is plenty of relevant resources out there. I’ll just summarize some of my favorites here:

Building Data Lakes in startups is hard

“Data Pipeline” by xkcd

In the startup environment, prioritization is a constant effort. You just cannot afford to assign a two-developer team (which at this point may represent a considerable percentage of your whole tech team) to building your data infrastructure, and keep them away from the Product for six months.

There are multiple resources out there on how to build your own Data Lake, but I feel none of them really consider the startup reality. That is the challenge we had at Pier: to build a Data Lake that is scalable, maintainable, fast to build, and cheap. Sounds easy, right?

It is usually a good strategy to break those big challenges into smaller problems, so let us make that exercise. In the end, what are the modules a good Data Lake needs?

  • Storage: data has to be stored somewhere.
  • Governance: we have to able to tell apart Raw from Curated data.
  • Ingestion Engine: data has to show up inside the Data Lake somehow.
  • Query Engine: there has to be some way to query the data.
  • Transformation Engine: there has to be some way to transform the data.
  • Orchestration: how to know a transformation is ready to run? Data pipelines are heavily dependent on orchestration.

We’ll dive deeper into each one of these categories, and I’ll try to show what is Pier’s solution as we go.

Before we dive in though, try to keep in mind that we need a solution that is scalable, easy to maintain, and relatively cheap. So, unless your business uses tons of data (like tens of TBs), as a Service solutions are probably the best strategy.

With aaS solutions, you can just focus on adapting tools that already work to your data — and not on on creating super-specific tooling. By following this aaS strategy, we get a scalable and easy to build solution (but not necessarily cheap: the price tag will pretty much depend on your data volume). At the end of this article, I’ll give you some insight on how much Pier spends on its Data Infrastructure monthly.

Storage

Data Lakes are usually kept in distributed object stores, where each store key points to a file. This is different from traditional Data Warehouses, where the Relational Database itself takes care of storage formats.

There are many Storage solutions out there:

At this point it may sound like all storage options are pretty similar, and in fact they are. But what we are choosing here is much more ample than a Simple Storage Service (see what I did there? 😉): we are choosing a Cloud Solution that we’re going to stick with, for a few years at least.

I strongly advise choosing a single Cloud Solution for your whole Data Lake, top to bottom. As we are going for the most “as a Service” approach as possible, that is just smart — tools will integrate better with one another and your life will be easier down the line.

So what we are doing here is not choosing S3 vs GCS: we are choosing AWS vs Google, for the whole Data Lake solution. It is probably smart just to keep the same Cloud Solution your company’s tech stack is already using.

Pier’s General Data Landscape

Here at Pier, out tech stack was already AWS-based — so it made sense for us to also build our Data Infrastructure on top of AWS’s services.

If you are one of the hipsters out there who do not trust Cloud Solutions at all and want to build the whole thing on open source tools… Well, good luck :)

Governance

This is the point where things start to get tricky. You should not just dump data anywhere inside the Storage — that might work when you have 5 tables to look at, but those things scale quickly. Pier, for example, is a 3-year old business with around 800 tables inside our Data Lake.

So, how do you organize that?

Well, there is not a one-size-fits-all solution. We just followed DZone’s proposal and created 4 different buckets in S3, each for a zone. To summarize it here:

  • Transient: where data is ingested into. This zone is (and is supposed to be) kind of messy. Data comes in from different sources and in multiple formats, sometimes even unstructured.
  • Raw: just as messy as transient. At the raw zone, however, everything should be SQL-readable and in a standardized file format.
  • Trusted: where things start to get more comprehensible. Remove test data, perform Data Quality tests, use better file formats — here at Pier, we use parquet.
  • Refined: raw schemas are not usually in the best shape for analysis. At the refined zone is where we keep Curated data, ready for analysis.

In the beginning, you should just worry about automating everything up to the Raw zone. Creating Trusted and Refined datasets is a very iterative and business-rule dependent work, that takes months or even years to reach some acceptable kind of maturity.

If all your company’s data is in the raw zone and SQL-readable, you have already gone a long way in organizing your data landscape. You can probably make do with just that for the next couple of investment rounds, before the need for really Curated data starts showing up.

Ingestion Engine

Data comes in many flavors, and you need some way to ingest that into the Data Lake. There are basically three different kinds of data sources:

  • Production databases / queues: the databases or queues used by your back-end. That is usually where the most juicy transactional data resides and also the one business people have the most problems accessing.
  • Third-party solutions data: you probably use a lot of third-parties that keep relevant data for your business. Some examples: all your navigation data is in Google Analytics, chat messages in ZenDesk, marketing emails in Mailchimp.
  • Static data sources, such as open Censorial Data or some specific Research your company has bought.

Not surprisingly, those three natures of data are usually ingested in different ways. There is no single solution for ingestion, and I personally have seen some pretty robust (and complicated) efforts out there. At Pier, as always, we try to keep things simple, and then live with some minor caveats.

For production databases: our back-end uses a microservice architecture, which means there are multiple different databases we need to ingest. However, all of those databases are PostgreSQL. So by creating a single ingestion engine for PostgreSQL, we can generalize ingestion for every microservice database.

The way we do it today is by running a pg_dump on every database, and saving the resulting text files in the transient zone in the Storage layer. Later on, we parse those text files (as they are all created by pg_dump, a single parser engine can parse every file) and re-create those tables in the raw zone (dropping the old tables that were already there).

This way, our raw zone has always a perfect copy of whatever is inside the production relational databases. The caveat we live with is that, if UPDATEs are performed on those databases, we also lose historical data in the Data Lake. This has not been a big problem for us, and we live happily with it.

The real complicated part here is creating that pg_dump parser. We created a python parser that reads those text files and generates two outputs:

  • A .jsonl file which can later on be read by our query engine. That is the file that is saved in our raw zone
  • The table metadata (columns and types, and S3 files locations) to be also fed to the query engine metastore.

This is one of the most complicated parts in our pipeline, and it took around two weeks for a Senior Developer to build.

For third-party solutions data, we use aaS extraction. There are basically two big players in the market that provide that service: StitchData and FiveTran. We went with Stitch and are pretty happy with it — they are reliable, cheap, and CX is responsive. We just feed them with our services’ credentials, and data shows up magically into our transient zone in S3.

For static data sources, as they usually only need to be ingested once, we do not create automations. We just use AWS Data Wrangler and upload the data source in parquet directly to the raw zone.

Query Engine

There are many options out there, and we went with AWS Athena.

Athena is pretty much Presto as a Service. Presto is a powerful Big Data query engine created by Facebook, which is capable of reading files in distributed filesystems using SQL. An important thing to notice is that Presto does not store files, just reads them — hence, it is not a database.

Keeping a Presto cluster, however, is far from trivial. Athena allows us to take advantage of Presto’s powerful features without worrying with deploying, scaling or maintaining a multi-machine cluster.

The way it works is:

  • AWS bills you for amount of data read to run a query. So, you just input your SQL text string, and Athena returns a table with the response.
  • Presto works in pair with the Hive Metastore: Presto reads the files and executes the query, but the Metastore keeps a lot of necessary information for a query to run: which files a table’s data consists of, table schemas and column types. For Athena, AWS Glue works as the HIVE Metastore.

Considering our company size (around 130 people) and data volume (tens of GB), Athena costs are pretty reasonable — around 300 USD/month. Considering we have zero maintenance costs on our query engine, we consider this a real bargain.

Transformation Engine

This is one of Athena coolest features: it is not only a powerful query engine, it is also a transformation engine.

By using Athena’s CREATE TABLE AS SELECT statements, we can literally perform any transformation we want on our data just using SQL. We version those transformations in a git repository, and have complete control over our data architecture.

Orchestration

When you first face the problem of orchestrating ingestion and transformations, it may seem like a good idea to just create some simple scripts that run based on cron-jobs, and run them once or twice a day.

That might work for the first few weeks, but trust me: orchestration scales fast. It is not hard to find medium-sized companies today that struggle to orchestrate hundreds of scripts using a real implicit cron-job hell.

There are specialized tools out there to do that for you, and they are really easy to deploy. The most famous one, and the one we use here at Pier as well, is Apache Airflow.

Airflow is pretty simple to use, and the best thing is: it runs in python. So all of our complicated operations can also be run inside this server: parsing pg_dump files, creating tables, updating Glue pointers — all using just our parser and AWS Data Wrangler.

Extra

At the end of this whole pipeline, we have a working Data Lake, with data ready to be accessed by anyone who knows SQL just using an Athena connector.

AWS’s browser interface for Athena, however, is not ideal — that is why people usually set up some data visualization software at the end of the pipeline. In our case, we use Redash, but there are many other good open source options out there, such as Metabase or Superset.

To sum up

Here at Pier we have created a Data Lake stack that is simple, reliable and cheap. This stack is mostly based on as a Service solutions and, in our vision, has the ideal complexity for a startup.

  • Although our stack is based on as a Service tooling, we are actually keeping two on-premises servers: one for Airflow, another for Redash — and we are keeping them like that pretty much for historical reasons (and because they are stable). Nowadays, both could also be substituted by aaS alternatives: MWAA is AWS’s version of a managed Airflow server, and redash.io also sells a managed Redash server.
  • This whole pipeline was put together by a single Senior Developer, and it took about a month to stabilize.
  • Costs for this whole pipeline are around 500 USD / month, considering 300 USD from Athena and 200 USD from the EC2 machines where we host Redash and Airflow. S3, Glue and other minor costs are negligible.

--

--