Start-up Data Engineering bible: Ingestion (Part 2)
Everything you need to know about data ingestion for data engineers at start-ups and beyond
About me
Hello đ Iâm Hugo Lu, a Data Engineer whoâs also worked in Finance and now CEO@ Orchestra. Orchestra is a data release pipeline tool hat helps Data Teams release data into production reliably and efficiently. I write about what good looks like in Data.
Introduction
In the last part of Ingestion I covered a few ways of thinking about how to structure your data ingestion. We saw there are two main factors to consider: speed / latency and throughput / volume. There are also considerations around the destination youâre using â sending data to a data lake is different to moving it to a data warehouse for example. In this article, weâll cover some different technical ways of achieving some of these methods, and weâll ignore streaming for now which deserves its own topic.
Cloud ecosystem data
Iâm going to arbitrarily define âcloud ecosystem dataâ as:
Data that sits in your companyâs cloud ecosystem in cloud-specific resources that is required for analytical or operational purposes
Examples would include:
- Azure: table storage, a hosted postgres instance, Microsoft SQL Server, azure data lake storage etc
- Google Cloud: GCP Data lake, various hosted SQL options, Bare Metal
- AWS: S3, Dynamo, Hosted postgres, and more
As data practitioners, we typically want to move data from the above to one of:
- Data lakes: GCP Data Lake , Azure data lake storage, Amazon S3
- Warehouses: Google BigQuery, Azure Synapse, Amazon Redshift
If data is moving within the ecosystem, you shouldnât need to do anything particularly complicated. This is because most of the big cloud providers have already built integrations within their own stuff to move it. For example:
- You can move data from ADLS to Synapse using Azure Data Factory
- You can move S3 data to Redshift like this
- GCP have some nice replication options
The only thing to bear in mind here is cost. Iâve never done database replication in GCP or S3, but generally data movement in AWS is pretty cheap. Azure Data Factory scales terribly, however â the pricing has a linear model, so 10 ops cost 10x as 1 op. This is not good if you have a latency requirement. If you need to go from every 30mins to every 15mins, some proportion of cost will 2x, which is very bad. Generally 2 incremental jobs will be more expensive than 1 with twice the amount of throughput, but nowhere near 100% more.
Generally for cloud ecosystem data, the first port of call should always be to see what the cloud-native recommended solution is. This also has a BIG security advantage â if your cloud infra is secured, itâs not compromised by sending your data to someone elseâs cloud.
Moving data outside the cloud
What if you use Snowflake or Lake FS? How does one get data there?
Well, there can sometimes still be cloud-offered solutions. Often, there will be a destination-specific methodology thatâs recommended â for example see this on how to load S3 to Snowflake.
In this example, you would need a workflow that runs the following steps:
- Stages data into a specific S3 location
- Opens a connection to Snowflake I guess using one of their SDKs
- Calls queries to copy the data into Snowflake
This obviously requires you to use an orchestration tool to trigger and monitor these jobs, so already thatâs a lot more overhead. This is, however, generally the cheapest way to do things. Particularly for Snowflake, doing it this way also allows some seriously nice fine-grained control that means your data ingestion can get very cheap (if youâre interested in learning more, I would check out this excellent article from John Ryan). I would argue being able to spin up what is effectively an app or lambda that is part of a data release pipeline governed by an orchestration tool is a pretty imperative part of data engineering, and definitely required if youâre working with BIG Data.
Change Data Capture
Or âCDCâ is a way to track changes in a database table. One can do interesting things with this data, but itâs important to define two more sub-categories of Cloud Ecosystem Data:
Update tables
An update table is one that is frequently updated and has relatively few insert statements
And:
Append-only tables
An append-only table is characterised by 0 updates and only insert statements
Examples would include a connection or status table (maybe there is an operation that has its status updated a few times throughout the course of its life, and is therefore updated multiple times). Example of an append only table would be something like an event store or a list of API calls.
CDC is very useful for database replication for update tables, where the history of the table is not necessarily of interest (we only care about the latest view of it). It allows for database replication in near real-time and is relatively cheap to do. Here, your cloud provider may offer this service, but lots of third party vendors offer it for cheap as well such as Estuary, Rivery, Qlik and others (just the first three I thought of from the top of my head).
You should not use CDC for append only. Here, the best way is probably the way described before â append-only tables should have an incremental key that can be used (typically an updated_time field) that allows for efficient incremental querying and loading.
Third party SAAS tool data
This is the funnest of the lot! Mainly because there are too many tools so navigating this is hard.
Something youâll realise is that most of the valuable data in your organisation is probably already in your cloud ecosystem. This means you should be able to derive lots of insight before getting this far. I define third party saas tool data as:
Data that resides in third party SAAS tools external to a cloud ecosystem
Here you have a few options:
Build
Build your own app, your own connectors, as an app service. Itâs not hard, but takes time. It also requires you to have a relatively sophisticated orchestration system. For example, if you build an app service that triggers an incremental load when it gets called via HTTP, you need an orchestration service to A) trigger it and B) monitor it. This is fairly advanced for most data engineers at start-ups.
You could also host something yourself like Airbyte or Singer open source.
Being able to do this is very important because inevitably at some point, there will be some bit of data that lives somewhere random you need to ingest.
Buy
Buy some connectors! Great shout if you ask me, especially for low-volume things like salesforce. The market for connectors is getting more competitive, and itâs really nice to be able to make it someone elseâs problem. There is also a clear security benefit to only buying connectors for data that sits outside your cloud ecosystem (you arenât really compromising anything more by letting Hevo or Fivetran or whoever process your data outside your cloud because if itâs coming from [Salesforce] itâs already outside your cloud. Contrast this with using these tools to move data within your cloud ecosystem).
One caveat â you really really need the tool to have a good API. Otherwise, if you donât build the interaction with the tool into your data release pipeline, it just runs on a schedule all the time and if it fails, you have no way of stopping (possibly very expensive) downstream operations running. Many of these tools have great UIs but quite bad APIs. Some do not even have an API, which makes using them compromise your best practice immediately e.g. Portable.
You already had it
Some premium versions of products built the connectors for you. This is the perfect scenario is itâs free and not your problem to maintain. Amplitude -> Snowflake is a great example, but only available if you pay for Amplitude. Many cloud warehouse/lakehouse vendors are also encroaching on ingestion toolsâ patch, and market âZero ELTâ solutions, which is basically a fancy way of saying âif you use these two tools, weâll let you sync data between them for freeâ â for example, Databricksâ recent partnership with Snowflake.
Manual stores
Hopefully you never have this problem but when I worked at JUUL we had it in a BIG way. We needed sales data to know how many JUULs were being sold everywhere and you can imagine how complicated this gets when youâre in thousands of stores across 12 countries (EMEA only) and you have a complex network of distributors. Something many tech companies donât do is also buy data from providers like IRI, who are quite behind RE technology. There are generally two patterns you should be aware of.
Excel / CSV based
Data exists in excels or CSVs. You should get whoever owns the data to drop it into a specific file with a specific path every [day]. You will then probably have a python function that reads the directory, parses the excel or csv, and pushes the information somewhere for further pre-processing. This is similar to âbuild your ownâ that we did above. Itâs technically very hacky, and very flaky because it relies on a human to drag and drop a file.
FTP
Data exists behind a File Transfer Protocol (FTP) folder. You will need to trigger the checking of new data on a schedule or make it event-driven. There is a nice blog on this here. Pray you donât need to do this.
Is that it for Batch?
Thatâs more or less it for Batch! We saw there are two types of data youâre mainly considering here:
- Data in the cloud ecosystem (âCloud ecosystem dataâ)
- Data in other tools (âThird Party SAAS Tool dataâ)
Within this, there are also âUpdate-tablesâ and âAppend-only tablesâ. These different combinations necessitate the use of cloud-hosted solutions, own-built infrastructure, CDC technology (possibly) or third party ingestion tools. There are also edge cases where people basically need to dump CSVs in a folder and data engineering teams have jobs that run on schedules or events to ingest this data. There are also some nice tricks you can use like utilising connectors in products you already pay for.
We completely left out event data or data in streams â this will be a separate blog post. The same is true of moving data from data lakes to warehouses. This exercise is highly specific and probably also deserves its own blogpost. For example, Databricks have some Databricks specific stuff that is definitely best practice, but only if youâre using Databricks. The same is true of Snowflake with Snowpipe, Snowpipe streaming and good old insert / copy into statements.
We also could have spoken a lot more about big data challenges. Generally, if youâre working in a start-up you hopefully wonât have too much data which means the strategies outlined above will align to you. Some of these do not scale with petabytes of data, and that should come as no surprise. At this level, the correct approach is generally to âbuild you own in the cloud of your choiceâ, but there is a much greater emphasis on whole system design (maybe itâs better for you to be getting the data event by event instead of replicating a database) and deployment (understanding the computers you need to run a load of spark is non-trivial, definitely harder than just deploying an app service in Azure). You may also have data across multiple clouds, or even multiple clouds in multiple companies! There are so many variations of batch ingestion / database replication weâve had to stick to start-up relevant bits to avoid boiling the ocean đ°
What did I miss? What else constitutes best practice for start-up data ingestion? HMU on Linkedin if youâd like to discuss đ
References
- GCP Databases: https://cloud.google.com/blog/topics/developers-practitioners/your-google-cloud-database-options-explained
- S3 to Redshift: https://docs.aws.amazon.com/redshift/latest/dg/tutorial-loading-data.html
- Azure DLS to Synapse: https://learn.microsoft.com/en-us/azure/data-factory/load-azure-sql-data-warehouse?tabs=data-factory
- Analytics Today blog: https://www.analytics.today/blog/snowflake-data-loading-best-practices-bulk-copy-operations
- https://en.wikipedia.org/wiki/Change_data_capture
- https://www.osmos.io/blog/how-to-automate-ftp-data-ingestion