A FanDuel Recipe for Data Platform Success: Four Principles Everyone Should Implement
Written by Marinos Kintis, Lead Data Engineer
Data platforms have become an integral part of modern companies. They define how a company’s data flows across its domains and the necessary transformations to prepare the data for supporting decision making and business intelligence.
The necessity of such platforms is unquestionable in modern times but the question still remains: How can we build data platforms that are successful?
In this post, we will present the lessons learned from building such platforms at FanDuel and focus on four company- and technology-independent principles that made this possible.
Data Platforms at FanDuel
FanDuel has a rich history of data platforms which began in 2013 (feel free to read more in our previous blog post). We are currently in the 4th generation of our batch-processing platform, called Automata, and in the 1st generation of our stream-processing platform, called Adrenaline.
Problems with Previous Platforms
Around the end of 2020, we decided to do an overhaul of our batch-processing platform to accommodate the fast growth of FanDuel. We primarily targeted three problems: first, our platform architecture required each pipeline to be created by a data engineer; secondly, as a direct consequence, there was no standardisation of pipeline design which increased their maintenance cost; finally, code reusability was difficult and testing costs were high.
What Made our New Platforms Successful
Automata and Adrenaline were built with specific principles that centred around addressing the aforementioned shortcomings. So what did we do differently this time?
Principle 1: Create an extensible, self-serving layer on top of the platform
The first improvement that we did was to create an extensible, self-serving layer on top of our platforms. With this layer, any FanDuel-er could self-serve their data needs with a simple configuration file. In Automata, such configuration files enable users to create Airflow DAGs that move data from various sources, e.g. buckets, databases, to several destinations, e.g. Redshift, without any Airflow knowledge. In Adrenaline, such files enable users to self-serve their streaming needs, e.g., create Kafka topics and schemas, Flink transformations, etc, without any knowledge of the underlying technologies.
Self-serving extensibility
Although our self-serving layer covers most of our users’ data needs, we further extended the platform’s functionality by adding extension points that allowed user-owned code to be included in its normal operation. This empowers our users by giving them the ability to plug in their own code to achieve an action that is not already covered by the self-serving mechanisms.
Principle 2: Standardise the design of data pipelines
Given that we faced situations where different pipelines, written by different engineers, were previously following slightly different software engineering design, we decided to reduce the pipeline variability by standardising their design. We wanted the new design to reflect the following attributes:
- A pipeline should have one and only one reason to change (a slightly adapted version of the Single Responsibility Principle)
- Our pipelines should be modular and composable
Applying these attributes to our new platforms, we adopted a design that enforced a single source and a single destination to our pipelines (more details about this will be provided in our next blog post.)
Principle 3: Treat the platform as a software project
This principle played a key role in the success of our platforms for several reasons:
- Lowered our maintenance costs
- Sped up new feature development
- Kept our data platform engineers happy
There are several books about software engineering best practices.Two must reads are Clean Code and TDD by Example. We found the following very useful:
- Document your Architectural decisions: this will add structure to the decision process and serve as documentation for the future
- Adopt Infrastructure as Code: this will enable a painless provisioning of the platform’s infrastructure; it’s important to design the code so you can provision the same infrastructure in multiple environments (e.g., a PreProd or QA environment)
- Adopt Continuous Integration (and Continuous Delivery where possible): it will make you think early on about how the platform’s code will be deployed and how it fits in your company’s wider ecosystem
- Invest in good code design, tech-debt repayment, and testing practices
Principle 4: Invest in end-to-end tests as soon as possible
Apart from unit and integration tests, it is valuable to focus on end-to-end tests early on the platform’s lifecycle. This enabled us to move fast, with more confidence in our code changes and with less manual-testing effort. An added bonus was that running these tests on a clone of our Production environment, gave us the ability to not only test the platform’s code but its infrastructure as well.
Recap
In this blog post, we presented four company- and technology-independent principles that helped us create successful data platforms for both our end users and our data engineers. In upcoming blog posts, we will delve into more details about our platforms and their innovative features, so stay tuned!
We are always looking for talented people to join us. If you are interested in working in an environment with great culture, great benefits, modern tech stack and serious employee-growth opportunities, don’t forget to check out our careers page.