Data Anniversary: My 3-Year Journey in the Voodoo Data Teams

Published in

Voodoo Engineering

12 min readMay 17, 2022

My third jobiversary (if that even exists) was in March '22, but it was also around the date back in 2019 when Voodoo started to leverage (big) data to help every team in their daily decision-making processes. To mark this occasion, I wanted to look back on those 3 years and all the experiences we’ve had and how far we’ve come.

This article focuses on the data-journey during these 3 years, both in terms of company organisation and data challenges. It might be easy for me to say this with 20–20 hindsight, but guess what, there were challenges on every corner, and those corners were often pretty damn sharp.

Context

Voodoo was born as a video-game company back in 2013. For quite some time, the core of the business was around the game development but soon they realised the added-value they brought in the publishing and marketing scopes during the life of those games. This was especially true in the hyper-casual market, and this small focus turned out to form the foundations of Voodoo’s success.

Voodoo’s growth has been unstoppable ever since. Our publishing teams have collaborated with more than 7400 studios (and 500 currently active ones) and they deal with around 200 potentially-launchable games per month. Our marketing teams manage the growth of these games and ensure their scalability and User Acquisition with a generous wallet.

Today, Voodoo has launched more than 450 games, garnered more than 6 billion downloads, and has more than 200 million monthly users worldwide.

The publishing and marketing teams represent the very essence of the company: take risks, think big, and iterate fast. In order to make good decisions, these teams have had to be pragmatic and learn from data, through trial and error, and all of that at a scale that only few companies have to deal with.

Growth of data-teams

The first newcomers aimed at scaling the business and were focused on gaming project management to help external studios, gaming development to create new games and marketing art to ensure the economical health of scale. These teams were relying on their business knowledge and simple data analysis. Then, innovation based on data-driven decisions became more important and complex.

Some of these newcomers started taking more data analytics roles, as other data role new hires were arriving to the teams. The chart below shows the company people’s growth with an emphasis on the data-teams between January 2013 and April 2022. The top chart (line-plot) shows the number of active employees in the internal departments. It intends to be explicit to give a real idea of the sizes of departments. The middle chart (stack-plot) shows the cumulative number of active employees. It puts into perspective the growth of the departments with regard to other departments. The bottom chart (area-plot) makes clear the evolution in the ratio of data active employees with regard to the total active employees in the company.

These charts show the growth of the company in terms of people, particularly comparing this with the growth of data teams at Voodoo.

These charts only account for teams around publishing, marketing and business growth, meaning that we exclude other collaborations and company acquisitions, which could increase this to more than 680 employees as of May 2022.

Voodoo has passed from roughly 60 employees to more than 460 employees in barely 4 years. When I joined the company back in March 2019 I was the first data engineer, and the company already had around 100 employees eager to improve their decision-making processes with data.

Today there are around 28 employees taking data roles, making up around 5% of the total employees. This is, however, both important and interesting because these few employees give support to many business teams in their daily routines, and enhance the business health by discovering new ways to spot potential in game-prototypes launched in publishing teams, by automating user acquisition with machine learning for marketing teams, or by providing insightful analysis in A/B tests for game operation teams. Of course, growth is not proportional in every department, and it naturally brings more requests to the data teams. Hence, building a solid data platform is key to serving these users and stakeholders in order to reduce technical operations and increase the time these data roles can spend on enhancing the business.

Side note: yes data teams need to grow too, we are hiring in every data position!

Data for the business

More than 3 years ago, Voodoo’s teams were making decisions based on a myriad of third-party tools.

The beginnings of these teams were very practical and straightforward: ad networks like Facebook offer business platforms to handle marketing campaigns and to generate reports, Game Analytics is an out of the box tool to analyse high-level in-game user behavior, AppAnnie can get good insights about competition and applications in the app stores, and analytic tools like MixPanel, Amplitude or Google Analytics can help to customize user behavior tracking and to analyse user base growth. Any tool was legitimate as long as it did the work quickly and efficiently.

At the end of the day, the business decisions were taken based on many different platforms and data points of view. For instance, the marketing team worked so tightly with Acquired, an external data aggregator platform, that the Acquired platform itself evolved and was almost shaped by Voodoo’s insights. Similar examples of collaboration also happened with Unity, MixPanel, Tenjin, and other platforms.

Data centralisation was key to performing more accurate data-driven decisions. The first steps to achieve this centralisation were made at early stages around 2017 with the automation of certain data reports coming from ad networks. This was created by the first software engineers of the company and it consisted of a simple application pulling daily data and exporting it into CSV files, which were then taken by business managers and imported into spreadsheets to perform quick analysis.

The show went on with more business ideas, more games, more iterations, more marketing tests, more acquisition… and it ended up generating a lot of data, which challenged the previous process. The beginning of 2019 was marked by the growth of our data teams, and it provided the means to centralise data more conveniently. The image below shows the first baby steps of the data platform.

Highlight and simplistic architecture of the data platform at Voodoo

The joint effort of all data teams made the bootstrapping of a data-lake approach possible fairly quickly. In a matter of weeks, we were able to deploy an in-house Apache Airflow cluster in Kubernetes in order to orchestrate data pipelines and automation processes.

We decided to code our own data collectors instead of using data integrations tools like FiveTran or Stitch because we expected to have a big amount of data traffics. Having full control of our collectors gave us more flexibility to scale our jobs. The downside of it was the (little) time we had to invest in the codebase. All this collected data was stored in AWS S3 in Parquet data format and adequate data partitioning for the different use-cases. In order to make data easy to query by data analysts, we decided to use AWS Athena and the AWS Glue Catalog. They could then connect Periscope, a data visualisation tool, to Athena and do SQL analytical magic. Finally, we used AWS SageMaker and other off-the-shelf machine learning services to give autonomy to our data scientists.

This approach was far from perfect, but it enabled the business to grow, ask new questions and create new challenges. We iterated on our approach and, step by step, reconfigured the whole architecture.

Our in-house Apache Airflow evolved into AWS Managed Worflow Apache Airflow + Kubernetes. This enhanced the isolation of our Airflow jobs. Some of our data collectors were refactored to enable fresher data and higher data volumes. We added better data policies for data lifecycle and cost control on top of our data-lake. Also, we started to do reverse ETL to enrich other software departments with the data coming from our data stack. On the data consumption side, we stopped using Athena for data visualisation purposes because it struggled a little bit when confronted with the sheer number of queries coming from our visualisation tool. Instead, we added AWS Redshift as data-warehouse layer to achieve better performance. Apart from this, we have also moved away from our initial visualisation tool and opted to migrate to Tableau, to make user dashboard permissions easier to handle.

All these changes have pushed the data platform towards better data governance. Today, all our data decisions come from this stack thanks to the efforts of all the data teams.

Tableau is our main entry-point to gather business intelligence and enhance our knowledge of our users, for example, as well as the analysis of user life-time-value, user engagement…
Reverse ETL data pipelines ensure the consistency of the data sharing and data decisions outside of our principal data stack, for instance for A/B testing platform, marketing bidding, and publishing iterations on prototypes.

Growth and Engineering internal team organisations

The fast growth of the company was a real challenge, in particular with regards to how to organise the teams. How is it possible to grow fast while keeping the culture of the company? (remember: take risks, think big, and iterate fast).

Image from Lucidchart — What is a tiger team

From the point of view of the data teams, being focused on a single business project and being able to properly share technological knowledge or contribute to a common technical foundation was a trade-off. On the one side, the data platform and base data models require maintenance. They are also likely to evolve to tackle new challenges. Having too many data roles tightly attached to business projects neglected this aspect. It also made it difficult to share knowledge and discoveries across data positions. On the other hand, when people spent more time on improving the data foundations, the business lagged a bit behind since their data requests could take a bit longer to be addressed. This paradigm also challenges the ownership of data members within the data platform. I will not lie, it was (and still is) difficult to handle for us. We have lived three major re-organisations:

I) Business-first organisation

Most of our data analysts were better qualified in business analytics insights rather than analytics engineering. In the first organisation, they were simply pinned to the business teams. Hence, analytical data roles (data scientists and analysts) could assist these teams on a daily basis. Data engineers were moved to a centralized data-core team, and they were in charge of the data stack, of some analytical engineering tasks, and of developing deeper and broader data topics like GDPR and data monitoring. This setup led to increased understanding of the business data, but the communication lines between data teams were not fluid, in spite of the regular cross-team meetings and knowledge-sharing sessions.

II) Foundations-first organisation

The second organisation within the growth and engineering teams tried to solve this issue by gathering people together. This mainly affected the data analytics team. Together with the data-core team, they were able to contribute strongly to the data stack by creating new data models using DBT. However, the ghost of the business had too strong a presence in their daily routines. This organisation reduced the communication between business product teams and analytical employees, which had an impact on the analytical support provided to business unit managers. As a consequence, the number of demands and tickets in the analytical backlog started to grow and slowed down new business ideas.

III) Focus-first organisation

We are now living the third re-organisation. Our teams are currently split into data core, data analytics, and data science. These teams are multidisciplinary and very autonomous. They contribute both to business and technical foundations. One key difference is that some members of these teams are cross-team now. This creates bridges for communication with the business and avoids technical isolation. Some of the measures we have set up are:

“Super-hero” process: one member of the team is responsible for gathering requests, reporting incidents, and solving quick technical issues. This role changes weekly and ensures that knowledge is shared across the team.
Flexibility: members with strong needs to work and communicate with other business teams can change their routines to slightly adapt them to the other teams. This gives the business teams confidence in everything we do.
Wider data governance: we all contribute to the whole data stack and we keep an eye on the business needs more closely. To ensure that we do not get lost along the way, we are committed to better governance of our data, in terms of its observability, accessibility, cataloging, and lineage.

The next and final section focuses on this very aspect — data governance.

Data governance: increase of ownership and tooling

Business data knowledge was very team-centric during the first team oganisations, prior to 2019–2020. These teams were full owners of the analysis and decisions made to enhance their business impact. At some point, this led to different KPI definitions, for instance, does user life-time-value take into account all kinds of revenues, or only those stemming from advertising?

For quite a long time data practitioners at Voodoo were simply ok with “just” delivering and fixing eventualities as they occur. Introducing more pro-activity and awareness around the whole data process was a big challenge. The arrival of more data employees, especially in analytical roles, pushed towards the creation of Data as a Product paradigm, for example, by taking time to add better data testing layers, having more insightful data monitoring, setting more meaningful data alerts, by improving data delivery pace, or by involving the final all data users in data processes more efficiently.

Thinking of good data practices in many data-driven companies can seem a luxury, but it is a necessary one. There are always trade-offs to make, and the culture of the company has shifted very little in its core: do it and do it fast. Hence, introducing better data practices in our daily work needed to be as easy as possible, with the lowest operational needs possible.

The chart below shows some data aspects to be considered in any data-stack. Many of them are trivial (e.g. data architecture, data modeling, business intelligence). We could talk about these topics till the cows come home, instead, I would suggest focusing on the ones which caused us to hesitate the most.

Good data practices — The data governance index

The first topic was data monitoring to come to terms with data quality. The first iteration aimed at delivering an overview of the data quality as soon as possible. We developed a tool that was included by default in all our data pipelines and which measured the volume of data in the data storage. On top of that metric, we plugged an out-of-the-box anomaly detection system provided by AWS CloudWatch. But this solution was not flexible enough: data monitoring is much more than just volume metrics. We decided then to extend the scope of the project and benchmark Montecarlo. The integration of Montecarlo was very smooth, and data practitioners found it easy to use and very well-adapted to our needs.

The second topic was the meta-data, with a focus on the data catalog. The approach for this was pretty much like the previous one. We started using a simple in-house system which ended up operationally expensive to maintain. We tried out Amundsen (by Lyft), but since we started to use DBT for data modeling and data transformations we simply used their catalog and lineage features.

The third and last topic we tackled was data lineage. This one was kind of easy and free since Montecarlo has a root cause analysis feature based on data lineage. Also, it can integrate with DBT to enrich the whole process, which suited our needs perfectly

These tools allowed us to set minimum data standards. There are still challenges we need to tackle: robustness on our processes to test and deploy new data models, coverage of data catalog practices, and enable much more flexible data exploration to increase the share of data users by integrating some product teams.

Conclusions

Rome wasn’t built in a day. Voodoo has proven expertise in the domain of gaming entertainment, and it still has enormous potential to develop this aspect. During these past 3 years, the data teams have laid solid foundations to support and foster further growth.

The key points that have allowed the data teams to do this are teamwork, agility, and expertise. It is ok to struggle and commit errors because the important things are always behind the learnings. These aspects were the cornerstone in every step of our data journey: from the data centralisation to the data team organisation, we have made it possible to grow the business, the data stack, and people.

This journey of growth at Voodoo has seen me grow as well, and I am certain that we have paved the way for the maintenance of good standards for the years to come.

Thanks to Huw Ryan for helping me out writing this article.