How do we keep our 70+ data engineers sane?

Published in

FanDuel Life

6 min readOct 4, 2022

Engineers are a unique group of people in society; they yearn to solve the hardest problems, dreaming of automating all of their tasks. For data engineers, this is a monumental undertaking with no clear industry standard up until now. Workflow and Orchestration engines are one step in that direction, but they are not a complete solution. In this blog post, we share our views on what a journey towards proper workflow looks like and the work our data teams at FanDuel are doing to help make it a reality.

Before we begin, let me tell you a story. Imagine it is 2013 and you have been tasked to get data from point A to point B and throw in some reports. This is probably the ‘primordial soup’ from which all data journeys start and is more or less what has happened to us. Being a big proponent of Open Source, we picked Luigi. A workflow engine that is known for ease of use over others, at least at that time. The name itself pays homage to the ubiquitous game we all played in childhood. True to its name, with its growth to adulthood, came challenges that Luigi was not up for and we quickly saw the snag in its armour.

Like every other engineering team, there comes a time when you have to introspect.

This was the first time FanDuel data engineers had to reflect and answer the question ‘What’s next’? The outcome was ERIE, the fondly named Airflow-based orchestration tool we chose. ERIE was everything Luigi was not. It was faster, leaner, better and more scalable while coming out of the box with a level of automation that Luigi users could only dream of.

We had learned from the experience that using Luigi had given us. And this time round we made sure:

The code was cleaner and well-structured.
Libraries were created with the intentional purpose to share code between different modules.
Alembic was used to manage migrations pushing a clear message of “self-serve” when it came to Data Warehouse Migrations.
Terraform as our chosen tool for Infrastructure-as-Code played a crucial role in enforcing standardization across the codebase.
Python 3 support while maintaining some Python 2 support to incentivize users to move to a “brighter future”.

Proclaiming that ERIE was modernisation with a purpose is not overstated. We overcame problems with Luigi like its reliance on outdated technologies by 2018. ERIE and Airflow at its core, even with all its bugs, still provided a better experience than Luigi,

which, in the end, is a crontab-based job scheduler.

Once we had tasted success with a successful “N+1” jump, we were always on the lookout for that inevitable ‘snag in the armour’. We didn’t have to wait too long before asking the question ‘what’s next? again.

Eventually, we realised that all that expressiveness came at a cost. There was no standardisation of features to leverage during the feature release. Even releasing common product features among pipelines took exponentially longer with each new pipeline added. We foresaw a future where we would never be agile enough to introduce features on agreed-upon timelines.

Where are we?

Automata is the next phase of our data journey. We used the principles of “self-serve” we learnt from ERIE and mixed them with better developer experience by creating an easy-to-maintain pipeline factory. This empowered anyone and everyone to contribute a pipeline in the form of an easy-to-understand TOML file instead of an esoteric python. As the first iteration, it delivered value on user experience but lacked the scalability of infrastructure.

The current avatar of our workflow engine Automata V2 is running on an event-driven infrastructure, powered by Kubernetes, Celery, Postgres and Redis. It builds over its predecessor’s strong suit of standardization of workflows through a domain-specific language (toml).

We re-focused on our products on the basis of business use cases. Our “Data Factory” module from Automata was great for 95% of use cases like:

Batch processing
Cleaning and Filtering data
Reporting
Moving data from databases, messaging brokers and file sources.

The rest of the 5% of use cases that were not generalized enough, was covered by features like:

Custom Source
Maintenance Dags

This enabled users to take the right call on what fits their needs.

With organizational growth, teams restructured and adapted themselves to this new challenge. For every system review, we started questioning ourselves along the lines of “build vs buy”, specifically to avoid the not-invented-here mindset.

Re-organizing on lines of verticals and into smaller self-sufficient units that look after verticals helped us become more agile. These changes have enabled us to reduce cross-team coordination required in day-to-day activities, increasing the “self-serve” mindset. In the long run, we are moving much faster due to these changes while delivering on key goals.

Where are we going?

Adrenaline

Every “N+1” jump unlocks a business case that has never been imagined, it opens opportunities to put bigger bets on the organization level; which is exactly what has happened with where we are now.

Once we got a taste of what is possible with “fast enough data”, processing near real-time data became a possibility. We started using our knowledge to build the next generation of domain-specific driven streaming product nicknamed “Adrenaline”, that would work in conjunction with Automata and provide a whole buffet of solutions when it came to data needs of FanDuel, within reach of a few keystrokes.

Automata as ‘More than ETL’

Automata is now being looked at as more than just an ETL tool, it’s an in-house managed solution that, with few toml files, can provide a scalable and highly available logic processor that can talk to a plethora of services including APIs, Databases and Cloud Services.

Some of the use cases in the wild that we have encountered:

Automating the manual process of cleaning, filtering and moving data from one cloud storage to another.
Reducing repetitive manual work across our organization, allowing them to focus on more complex problems.
Ingesting real-time data in micro-batches from across the organization and processing it.
Some processes result in direct revenue generation, so it became a profit centre instead of a cost centre.

What’s Next?

We don’t know what the future of the Data Domain looks like. Nobody does. But it’s a safe bet that this team will continue to be on the lookout for that next “snag in the armour” and pre-empting the next “N+1” jump.

To answer the question that we started this journey with, how do we keep our 70+ data engineers sane?

We,

Preempt Scale problems
Automate tasks
Create a well-maintained team structure, that keeps adapting to the growing organisation
Clear separation of concern between teams
Always ask when to build vs when to buy
And last but not least, “Build lasting solutions”

How do we keep our 70+ data engineers sane?

Written by shashank_singh