Hi from Data Management team at Eureka 👋
If you’ve been in the data field, even for a short while, then you know how complicated it is and how fast it evolves. New problems and solutions emerge every year, and everyone talks about the modern data stack that organizations can implement to solve their data problems.
There are many good articles about the modern data stack on the web already, so in this post I will skip straight to our take on implementing the modern data stack at Eureka. Note: this project is still in development, but we plan to go live very soon.
One of the major goals of the work described in this post is to simplify all data-related operations at Eureka, so “simple” and “easy” will be the words of the day.
Current state of the data platform at Eureka
Let’s take a look at an approximate representation of how our current data platform looks. The application environment (including most of the data sources) is hosted on Amazon Web Services (AWS), and our data platform is built predominantly on Google Cloud Platform (GCP).
This is a very simplified version of our data platform. For example, the AI platform is generalized as one “AI platform” block, and all minor data sources are put into the “Other sources” block.
However, just by looking at this simplified diagram, we can define several major problems with this architecture:
- Data (sometimes the same data) is scattered across several places which creates data silos and data duplications.
- Different data ingestion tools for different data sources. Most of these tools are self-hosted, which adds a huge maintenance burden. We are not just dealing with maintaining code — we are dealing with code written in different languages (currently mostly a mix of Python, SQL, Java, Go, Ruby, HCL and Shell scripts).
- Data from the operational MySQL database is both ingested into the data platform and accessed directly for analysis.
- Data from Appsflyer is not ingested directly into the data platform, but goes back to the application backend and processed there before being sent to the data platform.
Plus, there are other problems that are not depicted on the diagram but should be improved:
- The current data platform uses the same operational environment as applications.
- The data platform configuration is scattered across 10+ GitHub repositories, plus some manual configuration which is not defined anywhere.
- There is no data catalog that we can use to track what data we have and where it is. It often takes a lot of time to find and get the required data for analysis.
- While we have simple data quality monitoring in place, we can still improve our tooling and approach to guarantee higher data quality for analysis.
- Although we have proper data access controls to guarantee privacy and security of our data, there are new, simpler approaches to data access management that we can implement.
- With this architecture and the above-mentioned points, managing the data lifecycle becomes a complex (and financially costly) problem.
For a while, there was no dedicated person to manage this data platform. Big thanks to our BI and SRE teams who handled data platform management during that time.
In February 2021, Eureka established the new Data Management team to control this complex operation and improve the data platform overall (you can read more about that decision here (JP)).
Very quickly, we realized that instead of just fixing existing problems, we could create a new data platform based on tooling from the modern data stack to solve those problems at the root and simplify our whole data workflow.
New data platform
There were two important goals we wanted to achieve with a new data platform:
- Automate our data platform as much as possible while keeping the architecture simple and transparent.
- Increase overall understanding of the importance of a simple and reliable data platform inside Eureka.
I believe both of these topics are very important for a modern data platform, but are not talked about enough in the data community. I plan to write a separate post about them later, but for now, let me briefly mention how we approached these goals at Eureka.
Data Platform as Code
If we look at the logical layer of a data platform, we can see that it consists of a bunch of different technical resources such as databases, tables, pipelines, buckets, permissions etc. It’s very similar to an application infrastructure.
Often these resources are managed by different teams, and sometimes this includes lots of manual work. This inherently eliminates strictly defined dependencies between resources and leads to many potential problems, while failing to add to the transparency of the whole architecture.
What if we could just take all data platform resources, define them in one place, and let some software provision and manage these resources for us?
Fortunately, this approach, called Infrastructure as Code, and the software solutions for it already exist. We at Eureka use Terraform.
Terraform allows us to define our data platform resources with code and automate resource management. This maintains strict dependencies between all resources, and makes the architecture transparent, easily understandable, auditable, and much more secure. It also allows us to define access control policies on a very granular level (such as columns in a table) and easily see who can access what data.
Yes, this might sound tedious if you think about the need to create configuration for every small piece of a data platform, such as data warehouse tables and every user’s permission. But when you get used to this flow, it not only simplifies everyone’s work and protects from many potential problems, but it also brings an immense peace of mind knowing that the platform looks and works exactly how you defined it.
Note that this solution works pretty well for us since our data platform is cloud-native, but I admit, it won’t cover all possible scenarios and architectures. It might not be the most viable option for some organizations, especially enterprises with on-premise resources.
Data Platform as a Product
If we take a look at any application or service (like Pairs), we can see that most of them are positioned as products. What are the main characteristics of an application product? It has:
- Target audience
Thanks to these characteristics, users can recognize an application, and understand what the application does and why they might need it.
We know that a data platform has a clear purpose, functionality, and internal target audience. It also consists of various technical resources and code, just like an application. So why don’t we treat a data platform as a product too?
If you treat your data platform as a technical product and internally advertise it as such, you let everyone at your organization more clearly understand its purpose and importance. In the end, almost everyone in an organization is a data platform user in one way or another, right?
I strongly believe that this concept is extremely important for the success of any serious organizational data initiatives. However, it seems like this notion is still rather new as there are very few resources on the Internet about it. And to my knowledge, only a limited number of companies have come close to putting this concept into practice, such as Airbnb (Minerva) and LinkedIn (Dali). Please let me know if there are any other examples!
Following this idea, we created our own internal data platform product and called it Metis.
Eureka often uses ancient Greek names and words internally (like Eureka itself!), so we named our new data platform after Metis, the goddess of wisdom and deep thought in ancient Greek mythology.
Metis architecture and benefits
Let’s take a look at the representation of Metis architecture.
As you can see, Metis is not just a data catalog or a name for a data warehouse. Metis is a complete data platform fueled by a combination of various data tools and code. Every piece in this architecture has an explicit purpose and use-cases that do not overlap with other elements of the platform.
This architecture provides us numerous benefits:
- Simple data ingestion: there is only one, managed ingestion tool (we decided to go with the popular option Fivetran). We don’t need to manage multiple ingestion tools, and the amount of code for the ingestion part of Metis has been reduced from several repositories in different languages to just one repository with Terraform configuration (we still have a little bit of mandatory code on the sources side before Fivetran, but it is insignificant compared to the previous data platform).
- Simple data transformation: we will use dbt with its SQL-only approach for all data transformation work. There are really no better or similar alternatives available at the moment (though I’m personally looking forward to hearing more news from Dataform).
- Data catalog: to make absolutely clear what data we have and where, we will implement a Data Catalog and register all our data resources there. Users will be able to use a simple Data Catalog GUI to search for any data across all our data sources in one place, which will save them a lot of time and effort. Obviously, it will still require us to put on a data steward hat to manually catalog everything and keep it synchronized, but who said that data management was easy?
- Comprehensive data management and data governance: by fully utilizing a data catalog, policy tags and Terraform definitions, we not only have full control over all data in our data platform, but we can also create a single source of truth for data in our organization.
- Improved security, robustness and transparency: architecture, pipelines, permissions, and policy tags are all put into code with automated provisioning. Everything is transparent and can easily be audited via GitHub and logs. We know exactly who has access to what data. Also, we are moving the data platform into a separate cloud environment.
- Overall simplicity: we are getting rid of a lot of code and technical tools, making our teams’ work much easier and increasing our users’ trust in data. This also simplifies our language stack to include just HCL (Terraform), SQL + YAML (dbt) and occasionally Shell\Python (for Composer jobs that cannot be implemented in dbt).
As a result, we have a much more reliable, and at the same time much simpler data platform.
It’s clear that this architecture is completely batch-oriented. This is done on purpose to keep Metis simple, since the major focus is on business intelligence use-cases. However, at Eureka, we are also utilizing streaming pipelines where a batch approach is not enough, for example as part of the AI platform.
Moreover, positioning our data platform as a product greatly helps us to increase the platform’s visibility and users’ understanding of it. No matter what team or level an employee belongs to, they can clearly recognize Metis as an analytics platform, organizational data hub, company-spanning data sink, operational data source, or other related image specific to their kind of work.
We have only recently started to advertise our new data platform as a Metis product internally, but we are already receiving positive feedback from people on all levels across the company reporting that they can understand the importance of the data platform and its purpose much better.
This is only the beginning — we are still building the initial version of Metis, and there is a very long road of continuous improvements and new features ahead. There are a lot of things planned for Metis:
- Improved data lifecycle management.
- Enforcement of Amended Act on the Protection of Personal Information in Japan that goes into full effect in April 2022 (EN/JP).
- Thorough data testing & observability, and data anomaly detection.
- Improved data catalog (automatic metadata inference for all data sources, i.e., data discovery).
- Better integration with the AI platform.
- Renovation of business intelligence tool stack.
- Leveraging machine learning for business intelligence.
And much more! 2022 is going to be a hot year for data modernization at Eureka.
If you have any questions or would like to discuss any of the topics in this post, feel free to start a discussion in the comments or reach out directly!