Data Platform at Gympass: Building a Scalable and Democratic Data Culture

Published in

Wellhub Tech Team (formerly Gympass)

6 min readAug 26, 2019

Since 2018, Gympass has started investing into an even more robust product development team, recruiting and nurturing a skilled technology and product staff. This investment is entirely oriented towards a long-term goal of increasing business scalability, supporting our global mission of defeating inactivity as we expand our numbers of partners and users around the world.

With so many talented people working together, it quickly becomes very clear that aiding our employees’ productivity and satisfaction multiples the outcome of their collaborative effort. There arises the opportunity to create teams whose sole purpose is to create platforms and tools to automate, control and facilitate day-to-day tasks. At Gympass, we have a tribe called Platform Engineering, which houses some squads dedicated to this job. Keep this idea in mind, I’m going to stitch everything together, I promise.

During sprint cycles, product teams iterate using a recurrent scientific approach, evolving the product around observation and experimentation, always trying to optimize user experience metrics given the team’s scope of work.

Product teams should be constantly applying a scientific approach

To do so, everyone in the product teams, at some point, need to worry about data. Either you are observing and analyzing data, or developing new features to test hypotheses and collecting data, hence, you are participating in this never-ending cycle of scientific methodology.

The thing is, data science, in a broader sense, is not trivial, especially when we are talking about a constantly growing business. Collecting data from millions of users, protecting sensitive information across all data sources, analysing petabytes of data and reaching meaningful takeaways, all these require technology. A simple yet important question, whose answer may exist inside this moving ocean of data, may take ages to come to light, depending on the data platform behind it. Therefore, one of the squads composing the Platform Engineering tribe is the Data Platform squad.

The main goal of the Data Platform squad at Gympass is to empower product teams in those scientific iterations, easing the discovery cycle from end-to-end: from data collection to data analysis. Remembering that one of our concerns is scalability, and we need to make sure to deliver this goal with a sustainable, long-term strategy.

We believe that some questions that may arise in a product team have really hard answers, which may require more suited specialists, say Data Analysts or Data Scientists. However, other questions should be able to be answered easily by other members of the teams, without the need of having to wait for an expert, which may delay important insights. This approach relies on a learning curve, to be able to teach people the tools they need to answer those questions, a supporting platform and, more importantly, a wide spread data culture. That said, it is a highly scalable strategy and it creates a very data/metrics oriented mindset.

With this in mind, we have a strategy which stands upon three main pillars: pipeline automation, data democratization and monitoring.

A Data Platform that supports a mission to defeat inactivity around the world

A pipeline automation, so that it becomes seamless for the engineers to implement data collection, without having to worry about all the moving components. In a cascade of robust technology, data should flow automatically from user interactions to places where it can be further refined, structured and controlled.

A data democratization mindset creates the demand for various technologies that should be able to give a transparent access to the collected and processed data, applying governance and control to protect our users, while optimizing query performance and providing tools to create better analysis and visualizations.

Monitoring comes as the last piece of our plan, as a means to ensure continued quality over time. We should keep our platform’s health in sight, with alerts, self-healing components, and outrage protocols.

When we unite all those, we end up with a very DataOps mindset, which surely takes a lot from the DevOps manifesto, and it reflects in a lot of behaviors that ultimately fire up a healthy Data Culture. In the end, it is also very important to pass on all those values to the rest of the company, steadily spreading culture through examples and an aligned communication, with people in leadership positions acting as bastions of this collaborative workflow.

We started with a more strategic point of view of our data structure, precisely to send across the message that all this engineering started to serve a well-thought necessity, so that we can understand what it is that we are really going for. Now let’s briefly see how those values reflect into our current architecture with this simplified overview:

Brief overview of Gympass’ current, constantly evolving, Data Platform

Events, transactional DBs and third-party partners are the source of most of our available data. This data is generated by interactions with our product, in multiple engineered instances. At this point, our data collection begins. Data is automatically ingested into our S3 Data Lake, either by batch jobs, which uses a lot of Airflow and Spark, or by events from one of our message brokers, currently RabbitMQ and Kinesis, with Flink’s aid. Those processes save data in partitioned buckets for easier consumption, using formats like Parquet. An automated pipeline creates a transparent flow between data sources and our Data Lake, which is the center stage of our orchestra.

After data is gathered in our Data Lake, we have some layers to provide a controlled access to our data, implementing governance, protection of sensitive data, and more intuitive data models.

Using Airflow, Spark, Hive, and other technologies, we transform and load data into structured stages, refining data and adding modeling layers to feed into the Data Lake itself, our Data Warehouse or other applications. To create a robust pipeline, most of our data transformations can be replayed, allowing us to adequate changes or recreate datasets.

Data in our Data Lake can be accessed using technologies like Presto, which is currently hosted in a Kubernetes cluster and is capable of scaling according to demand, giving cross access to available schemas, which gives flexibility for exploration and performance for querying. On top of that, frequently queried information is usually transformed into more accessible datasets by ETLs via a looping feedback from teams accessing these data sources.

We also have a staging and a production environment that allows us to test changes before deploy.

This Data Platform is built and kept using tools that allow us a productive evolution and maintenance. We are constantly planning scaling solutions for our problems, either we are talking about infrastructure or project evolution.

Scalable tools we use at Gympass for a productive and maintainable pipeline

For the time being, we gave an overview of our Data Platform at Gympass, gliding through our values, beliefs and the efforts employed into reaching our goals. In the next articles about our Data Platform, we will enter into more details about each step of our ongoing platform, learnings we took during our journey, the challenges we had to face and the necessary technology to overcome them.

Data Platform at Gympass: Building a Scalable and Democratic Data Culture

Written by Lucas Garcia