The long road to data ingestion. Gimme the data. Photo by Rodion Kutsaiev on Unsplash

Start-up Data Engineering bible: Ingestion (Part 2)

Everything you need to know about data ingestion for data engineers at start-ups and beyond

9 min readSep 13, 2023

About me

Hello 👋 I’m Hugo Lu, a Data Engineer who’s also worked in Finance and now CEO@ Orchestra. Orchestra is a data release pipeline tool hat helps Data Teams release data into production reliably and efficiently. I write about what good looks like in Data.

Introduction

In the last part of Ingestion I covered a few ways of thinking about how to structure your data ingestion. We saw there are two main factors to consider: speed / latency and throughput / volume. There are also considerations around the destination you’re using — sending data to a data lake is different to moving it to a data warehouse for example. In this article, we’ll cover some different technical ways of achieving some of these methods, and we’ll ignore streaming for now which deserves its own topic.

Cloud ecosystem data

I’m going to arbitrarily define “cloud ecosystem data” as:

Data that sits in your company’s cloud ecosystem in cloud-specific resources that is required for analytical or operational purposes

Examples would include:

Azure: table storage, a hosted postgres instance, Microsoft SQL Server, azure data lake storage etc
Google Cloud: GCP Data lake, various hosted SQL options, Bare Metal
AWS: S3, Dynamo, Hosted postgres, and more

As data practitioners, we typically want to move data from the above to one of:

Data lakes: GCP Data Lake , Azure data lake storage, Amazon S3
Warehouses: Google BigQuery, Azure Synapse, Amazon Redshift

If data is moving within the ecosystem, you shouldn’t need to do anything particularly complicated. This is because most of the big cloud providers have already built integrations within their own stuff to move it. For example:

You can move data from ADLS to Synapse using Azure Data Factory
You can move S3 data to Redshift like this
GCP have some nice replication options

The only thing to bear in mind here is cost. I’ve never done database replication in GCP or S3, but generally data movement in AWS is pretty cheap. Azure Data Factory scales terribly, however — the pricing has a linear model, so 10 ops cost 10x as 1 op. This is not good if you have a latency requirement. If you need to go from every 30mins to every 15mins, some proportion of cost will 2x, which is very bad. Generally 2 incremental jobs will be more expensive than 1 with twice the amount of throughput, but nowhere near 100% more.

Generally for cloud ecosystem data, the first port of call should always be to see what the cloud-native recommended solution is. This also has a BIG security advantage — if your cloud infra is secured, it’s not compromised by sending your data to someone else’s cloud.

Moving data outside the cloud

What if you use Snowflake or Lake FS? How does one get data there?

Well, there can sometimes still be cloud-offered solutions. Often, there will be a destination-specific methodology that’s recommended — for example see this on how to load S3 to Snowflake.

You need to stage files into an S3 bucket using AWS and then run a load of queries

In this example, you would need a workflow that runs the following steps:

Stages data into a specific S3 location
Opens a connection to Snowflake I guess using one of their SDKs
Calls queries to copy the data into Snowflake

This obviously requires you to use an orchestration tool to trigger and monitor these jobs, so already that’s a lot more overhead. This is, however, generally the cheapest way to do things. Particularly for Snowflake, doing it this way also allows some seriously nice fine-grained control that means your data ingestion can get very cheap (if you’re interested in learning more, I would check out this excellent article from John Ryan). I would argue being able to spin up what is effectively an app or lambda that is part of a data release pipeline governed by an orchestration tool is a pretty imperative part of data engineering, and definitely required if you’re working with BIG Data.

Change Data Capture

Or “CDC” is a way to track changes in a database table. One can do interesting things with this data, but it’s important to define two more sub-categories of Cloud Ecosystem Data:

Update tables
An update table is one that is frequently updated and has relatively few insert statements

And:

Append-only tables
An append-only table is characterised by 0 updates and only insert statements

Examples would include a connection or status table (maybe there is an operation that has its status updated a few times throughout the course of its life, and is therefore updated multiple times). Example of an append only table would be something like an event store or a list of API calls.

CDC is very useful for database replication for update tables, where the history of the table is not necessarily of interest (we only care about the latest view of it). It allows for database replication in near real-time and is relatively cheap to do. Here, your cloud provider may offer this service, but lots of third party vendors offer it for cheap as well such as Estuary, Rivery, Qlik and others (just the first three I thought of from the top of my head).

You should not use CDC for append only. Here, the best way is probably the way described before — append-only tables should have an incremental key that can be used (typically an updated_time field) that allows for efficient incremental querying and loading.

Third party SAAS tool data

This is the funnest of the lot! Mainly because there are too many tools so navigating this is hard.

Something you’ll realise is that most of the valuable data in your organisation is probably already in your cloud ecosystem. This means you should be able to derive lots of insight before getting this far. I define third party saas tool data as:

Data that resides in third party SAAS tools external to a cloud ecosystem

Here you have a few options:

Build

Build your own app, your own connectors, as an app service. It’s not hard, but takes time. It also requires you to have a relatively sophisticated orchestration system. For example, if you build an app service that triggers an incremental load when it gets called via HTTP, you need an orchestration service to A) trigger it and B) monitor it. This is fairly advanced for most data engineers at start-ups.

You could also host something yourself like Airbyte or Singer open source.

Being able to do this is very important because inevitably at some point, there will be some bit of data that lives somewhere random you need to ingest.

Buy

Buy some connectors! Great shout if you ask me, especially for low-volume things like salesforce. The market for connectors is getting more competitive, and it’s really nice to be able to make it someone else’s problem. There is also a clear security benefit to only buying connectors for data that sits outside your cloud ecosystem (you aren’t really compromising anything more by letting Hevo or Fivetran or whoever process your data outside your cloud because if it’s coming from [Salesforce] it’s already outside your cloud. Contrast this with using these tools to move data within your cloud ecosystem).

One caveat — you really really need the tool to have a good API. Otherwise, if you don’t build the interaction with the tool into your data release pipeline, it just runs on a schedule all the time and if it fails, you have no way of stopping (possibly very expensive) downstream operations running. Many of these tools have great UIs but quite bad APIs. Some do not even have an API, which makes using them compromise your best practice immediately e.g. Portable.

You already had it

Some premium versions of products built the connectors for you. This is the perfect scenario is it’s free and not your problem to maintain. Amplitude -> Snowflake is a great example, but only available if you pay for Amplitude. Many cloud warehouse/lakehouse vendors are also encroaching on ingestion tools’ patch, and market “Zero ELT” solutions, which is basically a fancy way of saying “if you use these two tools, we’ll let you sync data between them for free” — for example, Databricks’ recent partnership with Snowflake.

Manual stores

Hopefully you never have this problem but when I worked at JUUL we had it in a BIG way. We needed sales data to know how many JUULs were being sold everywhere and you can imagine how complicated this gets when you’re in thousands of stores across 12 countries (EMEA only) and you have a complex network of distributors. Something many tech companies don’t do is also buy data from providers like IRI, who are quite behind RE technology. There are generally two patterns you should be aware of.

Excel / CSV based

Data exists in excels or CSVs. You should get whoever owns the data to drop it into a specific file with a specific path every [day]. You will then probably have a python function that reads the directory, parses the excel or csv, and pushes the information somewhere for further pre-processing. This is similar to “build your own” that we did above. It’s technically very hacky, and very flaky because it relies on a human to drag and drop a file.

FTP

Data exists behind a File Transfer Protocol (FTP) folder. You will need to trigger the checking of new data on a schedule or make it event-driven. There is a nice blog on this here. Pray you don’t need to do this.

Is that it for Batch?

That’s more or less it for Batch! We saw there are two types of data you’re mainly considering here:

Data in the cloud ecosystem (“Cloud ecosystem data”)
Data in other tools (“Third Party SAAS Tool data”)

Within this, there are also “Update-tables” and “Append-only tables”. These different combinations necessitate the use of cloud-hosted solutions, own-built infrastructure, CDC technology (possibly) or third party ingestion tools. There are also edge cases where people basically need to dump CSVs in a folder and data engineering teams have jobs that run on schedules or events to ingest this data. There are also some nice tricks you can use like utilising connectors in products you already pay for.

We completely left out event data or data in streams — this will be a separate blog post. The same is true of moving data from data lakes to warehouses. This exercise is highly specific and probably also deserves its own blogpost. For example, Databricks have some Databricks specific stuff that is definitely best practice, but only if you’re using Databricks. The same is true of Snowflake with Snowpipe, Snowpipe streaming and good old insert / copy into statements.

We also could have spoken a lot more about big data challenges. Generally, if you’re working in a start-up you hopefully won’t have too much data which means the strategies outlined above will align to you. Some of these do not scale with petabytes of data, and that should come as no surprise. At this level, the correct approach is generally to “build you own in the cloud of your choice”, but there is a much greater emphasis on whole system design (maybe it’s better for you to be getting the data event by event instead of replicating a database) and deployment (understanding the computers you need to run a load of spark is non-trivial, definitely harder than just deploying an app service in Azure). You may also have data across multiple clouds, or even multiple clouds in multiple companies! There are so many variations of batch ingestion / database replication we’ve had to stick to start-up relevant bits to avoid boiling the ocean 🚰

What did I miss? What else constitutes best practice for start-up data ingestion? HMU on Linkedin if you’d like to discuss 😃