Data movement everywhere

The art of building your own ELT

And why you should (paradoxically) still buy it

6 min readAug 16, 2023

--

If I’d worked at Fivetran for more than 5 years, I’d be laughing.

Fivetran was founded in about 2012, over 10 years ago. It was one of the first success stories in the data space, and offered out-the-box connectors for syncing data from saas tools into cloud warehouses. Fivetran’s widely heralded as the leader in the “Modern Data Stack” (I think they came up with that term?) and is one of the largest companies in data outside of the big 3 cloud providers and data warehouse providers.

They claimed an enormous first-mover advantage for what is fundamentally, a commoditisable solution. As data engineers, we don’t really care an awful lot about service levels or bells and whistles, we generally just want to be able to move data from A to B reliably, and have good oversight with whatever tool we use to do it (either via API, a nice UI or otherwise).

With the advent of Gen AI, it’s never been easier to build your own. Ought to leave Fivetran quaking in their boots right?

How is it so easy to build your own AI App?

With a bit of knowledge around how to structure an application, it’s pretty straightforward to build your own version of an ELT app.

There are a few components:

  1. A router; an object that has all the routes for accepting API requests
  2. A class that defines pulling data from a source, that also has some basic methods. It’s a bit like an interface in other languages (I’ve got a Python hat on today).
  3. A class that defines pushing data to a sink
  4. Integration services that allow classes to interact with tools, like Salesforce, Hubspot or Snowflake / Bigquery on the warehouse side
  5. Lots of nice types (things like Pydantic classes) for easy and continuous validation

And this is about it. If you know what 1–5 are, it’s very easy to prompt something as generalised as Chat GPT to give you a pretty good start on writing something good. You can check out the repo here, feel free to contribute:

There are lots of aspects to this that we should really improve before use, though. These include (but are not limited to):

  • Creating an interface for integrations
  • Creating various interfaces and pydantic models for integration-specific objects e.g. Hubspot Contacts
  • Ensuring there is a proper API structure in place (the routes I chose initially are pretty arbitrary)
  • Having separate routers
  • tests (obviously)
  • Greater functionality for incremental loading which can potentially be in another interface
  • Logging
  • Additional keyvaults
  • Additional integrations
  • Better docs on CI/CD and deployment — currently you’ll need to know how to deploy a fast API somewhere for this to actually work
  • Authentication

The point is, you could iterate on this very very quickly once it was all in place, and it wouldn’t take you that long.

Added technical bonus: if you use something like this to deploy a really niche specific bit of ELT, you can include it in any orchestration pipeline! This is basically a small microservice, so as long as it has an entry point, whatever is running your orchestration (be it Airflow, Prefect or Orchestra) will be able to trigger it and include it in part of a flow. This is important because ELT should trigger transformations downstream, reverse ELT should depend on the success of transformations.

Does it stop at ELT?

Definitely not, but there ae some considerations:

  1. Maintaining this is long and boring

There is a reason people stopped writing Salesforce connectors 7 years ago, and that’s that it’s just so much quicker and cheaper to buy than build. Even with it being this easy to do, it’s still quicker to outsource, and always will be.

2. Ingesting data incrementally can be hard

Incremental loading can be tricky. Particularly if you have a Data Lake in your architecture and are maybe streaming stuff into it to be processed and moved to a warehouse later. Maintaining something like this and fine-tuning an LLM to know how to input on it isn’t great.

3. Streaming architecture

You would ideally use a single interface to manage all and any data ingestion. Streaming requires a different architecture to the one proposed above. I’ve no idea how you’d just go about building one; it could be easy, could be hard, haven’t given it much thought as using out the box services from cloud providers is just so easy.

So in summary, you probably don’t want to build your own ELT unless you have to but it’s easier than ever. Which brings me onto…

REVERSE ELT

Now here’s something you can have a decent stab at. It’s easier because:

  1. You probably don’t need to stream anything; batch is fine
  2. Data is clean; it’s been cleaned and dumped into snowflake first. Writing some python to pull data from snowflake is very easy indeed (much easier than parameterising merge insert statements anyway)
  3. Updating source systems normally only takes a single API call i.e. it’s just as easy as GETting data

What’s more — think about the functionality you just built for your ELT app…you’ve got sources, you’ve got sinks. But you’ve got a framework where everything is really just…an integration. Snowflake can be a sink. It can also be a source. This is another great example of false dichotomies in data (where you think something either has to be A or B not both A and B — be wary of false dichotomies in data, it’s normally just marketing).

Something I still don’t really understand is why Fivetran have never moved into rELT. You’d think with sound architecture it would be trivial to add some post endpoints to their salesforce integration and some GET ones to their Snowflake integration.

So you’re in a fun position if you decide to build some ELT, because it’s 80% of the way there to allowing you to push data back into source systems (you built the integrations, now add the endpoints).

Fivetran joy

If I’d been a long time fivetran supporter, I would be rejoicing now for a few reasons:

  1. Although it’s never been easier to build ELT, it’s also never been easier to build rELT. This gives pretty much every ELT vendor a chance to move into the space (just look at Hevo)
  2. The race has already been won — this is fundamentally a market that needs to look very close to perfect competition, or at least monopolistic competition if you think about factors differentiating on service and other things like enterprise features. Despite this, Fivetran has enjoyed years and years of high fees and now has some power to maintain these into the future having secured them already

And this sort of goes for all ELT providers but sort of not. For those that have been around for a longer period of time, they’ve likely enjoyed periods of supernormal profits which is obviously nice. They’re also now seeing an opportunity to move into other areas of data movement, and can potentially iterate extremely quickly due to Gen AI.

For those who have been around for shorter periods of time, it’s a worrisome, since it’s never been easier to build a lightweight data movement API platform. There will be an enormous amount of competition at the lower end of the market where vendors cannot afford to develop the bells and whistles like data residency and SOC 2 certifications required to unlock larger enterprise contracts.

However, it’s great news for us data engineers.

More competition in supply is exactly what makes those on the purchasing side of services better off. And sure, maybe you’re the 1 person in 20 who’s actually now going to build this yourself and have an awesome microservices setup that’s homegrown and costs you $20 a month — in which case this is all irrelevant.

But if you’re not that person don’t fret. ELT vendor competition is coming to a city near you, so buckle up and grab yourself a deal.

--

--

I write on Data engineering and the coolest data stuff. CEO@ Orchestra, the best-in-class unified control plane for dataops. https://app.getorchestra.io/signup