Enterprise ETL automation on GCP
Large enterprises never have just one or two data sources. It’s always tens or hundreds of places they need to pull data from, if not thousands. What may have started out as a couple of shell scripts in the early days of a company, may have turned into a crontab of tasks, which morphs into a spider-web of cron-dependencies, and eventually you’ve got program management sniffing down your department’s neck saying “Why does this ETL process keep failing, and costing us money?”. Eventually a program comes into place to clean up your tech debt, and get everything under a single platform.
The question arises to become, “How do we get it all under control?”, and you’re off to investigate different ETL tools. Let’s take a look at what GCP has to offer in the space.
A selection of tools and paradigms
At the core of it, the idea is simple — use a framework to be able to reduce the amount of manual work done. That means no more shell scripts, or snowflake style execution VMs. We have re-usable components, and rapid development tools. All of google’s offering fit the bill in some manner, but they differ in the details.
We have the following managed service options:
- Cloud Composer — open source apache airflow.
- Cloud Data Fusion — open source data integration, transformation, and analytics tool.
- Cloud Dataprep — closed source partner solution for data integration, transformation, cleaning and analytics.
Your choice of which to use will depend on your teams skillset, your companys legal position around data, and your budget.
Airflow is based around python — you develop pipelines by writing python code. Read more here: https://mark-mccracken.medium.com/airflow-dataflow-scalable-secure-and-reliable-data-integration-aac925195a56.
If your team aren’t proficient with python, this might be an instant blocker. When things inevitably go wrong, you need to be able to read documentation, stacktraces, and be comfortable with the basics of writing some python. You also need to be comfortable with kubernetes, as composer deploys worker resources in a kubernetes cluster, so if something goes wrong, you’ll need to be able to dive in and debug — or have someone from your infrastructure or operations team who understands the architecture of airflow and kubernetes to do it for you.
However, if none of the above are an issue, then cloud composer is likely a great option. Writing code in python provides the most flexibility here. Airflow also has a very advanced scheduler, as this was one of the main goals. Backdating a section of data for every Tuesday in January would be a fairly straightforward affair, and although there’s something of a learning curve to it, this can be bliss when you get it right. It offers a lot of flexibility, and great dependency management features, to ensure your ETL process run in the correct order.
Airflow is open source, and there are frequent updates by the community. They’ve recently launched airflow 2.0, which looks like it was not built in the early 2000s — but you might need to be patient for this to land in google’s Cloud Composer. One of the most attractive aspects of airflow, is being able to write any arbitrary python code with minimal fuss, and if you find yourself using the same snippets or areas of code repeatedly, it’s relatively pain free to turn these into simple re-usable operators.
One such example I’ve made use of in the past, was a tableau dataset refresh operator — after your ETL dependencies have completed for the sources required for a tableau workbook, run a small task to refresh the data source in tableau. This allows your BI team to get up-to-date data out to the business as early as possible.
The operator for this is simply a few API calls to tableau’s publicly available API if you’re using the SaaS product, or your tableau server’s API on premise, and it looks like so:
This isn’t anything complex, it’s taking a python script and putting it into a class with an execute method. This won’t be much of a stretch for even beginner python devs, to make their code re-usable. This then allows us to write very natural looking workflows (called Directed Acyclical Graphs, or DAGs):
Google’s Cloud Composer is a solid and modern implementation of the open source tool. It is regularly updated, and pretty secure by default. You pay for the underlying kubernetes resources which you are able to access, the web server running on an app engine in a peered project which you can’t see, and also the backing database which is kept inaccessible — however if you’re very determined, you can connect to it through the kubernetes cluster. But these components are hidden for good reason, you likely don’t want to fiddle with them. Google does a mostly very good job of making sure your platform stays up and running, and that you don’t need to dig into it. Pricing is very reasonable for the resources you are consuming, and amounts to virtually no more than you’d pay for running and managing it yourself using the same components.
Since it’s so heavily based on code, but mostly abstracted around a well documented framework, airflow will leave anyone comfortable with coding in SQL and python happy with their new tool.
Cloud Data Fusion
Data Fusion is based on an open source tool called CDAP, from a company named Cask, which Google acquired in 2018 to bolster their data solutions offerings. The target audience here seems to be enterprise data warehousing teams, who need to ingest a lot of data from different sources, and who want to spend as little time as possible writing code.
The environment is based around a web application to create data pipelines for ingesting data from sources, running them through a series of transformations and analysis steps, and directing the output to your destination sink. For look and feel, it’s like something from the mid-2000s, built for data engineers, by data engineers:
I must admit, I have created apps focused on frontend user experience, and also several tools for data manipulation and reconciliation — the latter somehow did not get much love in the looks department, and the team behind CDAP should be proud, because mine was functional, but considerably more hideous! A story for another article perhaps.
So what’s it like to use? Functional. It gets the job done. It’s laid out easy enough to find what you’re looking for pretty quickly, and there’s a large array of pre-built connectors, transformations, and sinks, to get your data into shape and into position. There’s a data wrangler that is pretty intuitive to work out, an hour or two after playing with it, and you’ll feel comfortable with chopping and changing things. After each job run, you get an analysis of the work done in your jobs, to see summary data information.
There are a few important aspects to the underlying tool CDAP that are interesting — it is open source, so if you have complex business requirements for on-premise infrastructure, this is something you could ask your infrastructure team to roll out there as well. The underlying workers for jobs are Hadoop clusters, or Dataproc on GCP. This means every record is ingested and processed manually, so you can actually get separate feeds from each output task, for successful and failed records, and treat them accordingly, which I think teams experienced in enterprise data warehousing will find very useful. The compute model here is also elastic, but can feel a little slow to get starting a job, compared to other platforms.
In theory, this should be all hands off, but if something goes wrong, you’re going to need some level of insight into hadoop clusters, processes and log messages, to resolve, which might be a painful experience. You can develop new plugins for the platform, but this is done in Java, which may not be what you signed up for when you wanted a “no-code ETL environment”.
There is a scheduler, which might satisfy basic needs, but this isn’t nearly as advanced as airflow’s scheduling and back-filling capability:
Data Fusion comes in different tiers, but any sizeable team will demand the enterprise version, which runs up $3,000 a month minimum — and this does not include the execution environment for your workers, just the service. This sounds like a lot, but it’s likely in line with other enterprise grade tools in the same space.
CDAP offers a data lineage tool. Personally, I feel like data lineage is always a “nice-to-have” for every org, but in practice it’s not something I’ve seen actually used much, when it is in place. The number of times I’ve seen an instance when it would have come in helpful, is worth a lot less than the effort it takes to create and maintain. I’ve found it more likely that two different reports in BI platforms can show numbers that are contradictory, and data lineage tools often may not be able to reach this far down the data pipeline to tell where items might have diverged.
It’s far from sexy, but this might be just the tool for a large enterprise to consolidate large amounts of processes into a common format, and the value proposition could definitely pay off. There aren’t really any serious drawbacks, the tool is very capable.
Now this one is eye-turning! Definitely pitched towards data analysts and miners, this is incredibly useful for profiling your data for quality and sparsity. If you’ve got messy data sources, this looks like the perfect tool to whip it into shape.
This particular tool comes with some major caveats though, so proceed with caution: it isn’t open source, and is provided by Trifacta in partnership with Google, so this will mean talking to your legal team about allowing their platform to see your data. There’s no open source equivalent you can migrate to, or deploy on premise. And unless you’re strictly dealing with GCS, sheets, and BigQuery, additional connectors come at a hefty expense, with a licence model per user. However, if those might all be ok, then you’re in for a treat.
The experience revolves around ingesting data from source and doing a lot of the work in a highly specialised data wrangler pane
The specialised bar at the top give a great overview of the diversity, quality, and sparsity of your data. The manipulation tools are straightforward to use, it feels more like editing a google sheet than working in a clanky database tool. You’re able to take samples from very large datasets to get the gist of what your transformations will look like, and even re-sample with advanced sampling features, like stratified samples! I was thoroughly impressed with the quality of the tool, it can perform joins with a fantastic preview pane, and the manipulation capabilities are on par with some advanced bigquery features. I was genuinely blown away with some of it’s capabilities, like pivots, macros, and data standardisation, and after a few hours, I’d throughly recommend it to a few people I know who like to work in this manner, and are unfortunate enough to have to regularly deal with messy data.
The workers run dataflow jobs to ingest and transform data as needed, and these spin up pretty quickly. You can get job metrics from the dataflow UI page in the GCP console, but also data profiling similar to data fusion, in the tool’s UI:
Trifacta also provides an in-memory solution for datasets of up to 1GB, and it kicks off and executes jobs very quickly! No waiting 2–3 minutes to spin up dataflow job and wait for it to spin down again, just go. Although we’re only talking about minutes, when that’s eliminated, it keeps you in your working flow.
What’s the catch?
- The pricing model is… not cheap. Unless you exclusively use google sheets, GCS, bigquery and local CSVs, you’ll need to shell out for the professional edition to get connectors to other sources, at an eye watering $400 per month, per user. For a team of 5, that’s $25,000 a year, before you’ve even paid to process any data!
- Connector missing? Too bad, because you won’t be developing plugins.
- Did I mention the price? If you want more advanced features like single sign on, you need to go through their Sales department.
What should we use in our team?
These tools all have different approaches and audiences. If you’ve got a team of comfortable developers, Cloud Composer might be a great fit, but there’s still a learning curve to adjust to, as with any framework. Cloud Data Fusion and Dataprep can act in a similar fashion, although you should consider your business constraints before diving too deep.
I wouldn’t be surprised to find enterprises using a mix of Cloud Composer with some of the other tools, as and when they fit with the teams skills. Airflow has operators available to reach into Data Fusion and launch pipelines, likewise for running job groups in Dataprep.
If none of these looked appealing, or weren't end-to-end enough, there are 3rd party businesses waiting to take your money and manage your ETL for you completely, such as FiveTran and tray.io — these platforms offer you data integration from hundreds of sources, all set up in a matter of clicks and minutes. They are extremely helpful and land your data straight into your warehouse in a nice format, but your mileage may vary when it comes to your own custom ETL tools — I’d be surprised if a large enterprise could get by solely relying on these for ETL tools, but they definitely had a value add when I used them.
Alternatively, if all of these were too heavyweight for your team, Google also offers Workflows, a totally serverless approach. If cloud functions are the glue of events, the workflows would be it’s partner in processes and scheduling. It’s a very young product, probably suited to much smaller and more agile teams, but could be what you’re looking for, for very simple use cases.
What I’d recommend above all, is getting some qwiklabs credits for a few members of your team who will be using the platform, get them to test drive Data Fusion and Dataprep for a few days and use cases to get to know the tools, and report back on how useful they will be. Nothing beats some hands on experience to get familiar with things, and make a proper assessment.