Building a Data Platform from scratch

Rodel van Rooijen
10 min readMay 28, 2024

--

Solvimon's bespoke analytics experience

For businesses around the world, pricing strategies have consistently played a pivotal role in facilitating growth. Traditionally, companies relied on fixed pricing models, taking into account factors like production costs, market competition, and perceived customer value. However, the pricing game is undergoing a significant transformation in recent years, reshaping the business world as we know it. The market is embracing usage- and subscription-based pricing to allow for flexibility, steady recurring revenue, and to facilitate customer preferences. Usage-based pricing opens up a plethora of possibilities that can be leveraged, e.g. by offering customers tailored pricing and providing them real product value. With this shift businesses have to track behavior and can obtain insights into product usage that can enhance profit and drive growth. This can only be realized by utilizing large amounts of data, it being the cornerstone of usage-based pricing.

At Solvimon, we are at the forefront of this revolution, developing a cutting-edge billing and pricing solution tailored to meet the demands of this new era. Our system encompasses the entire pricing process, from generating quotes to securing payments. In this blog post, I, as the founding data engineer, will share my experience of creating a data platform from the ground up.

As we embark on a journey through the intricacies of building a data platform, we will explore the challenging decisions, essential components, and technologies that transform raw data into data products. It will highlight how it can be a game-changer in today’s data-driven world.

The power of starting simple

As a company in its initial stages, it is crucial to deliver value quickly. Therefore, it is important to rightly pick the first data use case to tackle. Here’s the catch: start simple, but build a foundation that can accommodate future use cases. In the data world, it can be tempting to immediately delve into the realms of advanced analytics, machine learning, or even generative artificial intelligence. However, especially when starting off, such complexity is often unnecessary. It can lead to overcomplication or worse, it can end up in no tangible customer value. The fact is that a reliable data foundation is necessary before going after the more complex data use cases.

Our first data use case? Providing clients with descriptive analytics. First, the focus was to define the elements which should be achieved:

  • Value: Metrics and visualizations that provide immediate insight, e.g. into invoicing, usage and customer information.
  • Embedded: Direct integration into our front-end. This minimizes hand-overs between different functions (data, back-end, front-end), which frees up capacity to focus on other feature development.
  • Multi-tenancy: A single data source of truth while customers should only be able to see their respective data.
  • Flexibility: Customers have the ability to slice and dice according to their own needs.

The fundamentals

Before I began building our data platform at Solvimon, our core applications were already taking shape, deployed within the Google Cloud Platform (GCP). Our back-end technology stack is built using Kotlin, with data streaming through Kafka (event data), and data stored in a PostgreSQL database cluster (configuration and summarized data).

While I could have used the existing stack to achieve the use case, I aimed to build a foundation that can scale, provide immediate value, but can also facilitate future data use cases. Achieving this required addressing several key questions:

  • Storage: Where and how should data be stored?
  • Transformation: How can high-volume data be transformed effectively?
  • Orchestration: How will data jobs be orchestrated?
  • Freshness: How is data punctuality ensured?
  • Deployment: How is code managed and deployed?
  • Costs: How are costs controlled associated with data infrastructure?

Embracing Google Managed Services

I have experience working at companies that host their data platforms on-premise, primarily using open-source technologies. Setting up a data platform in the cloud presents a different scenario though. While hosting a data platform on-premise comes with greater control, it also comes with many additional infrastructure complexities. GCP’s managed services streamline these complexities, but they do come at a cost. Nevertheless, these services make it possible to move swiftly with limited resources.

Leveraging my experience with Spark and Airflow, I suggested the following choices, as a team we decided to use:

  • Python as the primary programming language. Even if you are not familiar with the data world, there are many reasons to pick Python over other languages. I would mainly highlight the extensive eco-system for data science and data engineering and the great open-source community powering this, making it the de-facto language for data.
  • Cloud Dataproc to manage data transformations. This is Google’s managed Hadoop service, which offers the ability to run serverless Spark jobs both for batch workloads and interactive workflows.
  • Cloud Composer for job orchestration. Cloud Composer is Google’s managed Airflow service.
  • BigQuery deserves special mention for its dual compute (querying) and warehousing capabilities. When budgeted wisely, BigQuery competes with open-source alternatives in capability.
  • Looker for both embedded analytics and business intelligence.

The conscious decision was made that data processing will initially be limited to daily and hourly batches. In the initial stage the focus was on balancing data freshness and cost effectiveness, by running our data pipelines at a set frequency we can take full control of this. Furthermore, for the analytics use cases we typically look at longer historical data for which hourly updates are sufficient.

Some points to consider:

  • Batch versus streaming: If desired, the serverless architecture can be swapped out with dedicated Hadoop clusters using Dataproc. When combined with Flink or Spark streaming, this setup could facilitate near real-time use cases.
  • Distributed data processing: If the amount of data is limited, distributed data processing tools like Cloud Dataproc/Spark might be overkill. For example, spinning up a serverless job takes around one minute, while a small transformation might take less than a second. In this case, there is the option to run these transformations directly using BigQuery SQL. The reason to not run all transformations directly on BigQuery is a combination of cost (when dealing with large tables costs are significantly smaller when transforming them with Spark) and functionality (PySpark gives an extensive toolbox to perform even the most difficult transformations).
  • Flexible core components: Each of these core components can be replaced with alternatives in the future. There were numerous alternatives considered but not selected, examples are Cloud Dataflow (Beam) for transformations, dbt for orchestration and Trino for querying. Retaining flexibility is one of the reasons to not go for an offering that could run the entire data stack, e.g. Snowflake or Databricks.
  • Future use cases: This data stack offers flexibility and scalability, making it suitable for a wide range of use cases. For example, serverless jobs can be used to train machine learning models in the future.

Reconsiderations

During the initial implementation, I discovered that some of these choices did not work out as expected.

Just deploying Airflow
In the end, I chose not to use Cloud Composer due to challenges with setting it up and its cost effectiveness. There were several issues which I faced when trying to set-up Cloud Composer, primarily related to running it in a Virtual Private Cloud network. In addition, Cloud Composer comes with a substantial monthly cost. This combined, I instead opted to deploy Airflow myself, offering greater control and cost-effectiveness to meet our needs. In perspective, a minimal Airflow self-managed set-up runs at 3% of the cost of running the smallest Cloud Composer environment, while giving us full control over its deployment. Managing it ourselves comes with additional complexity, but with GCP’s extensive tooling around setting up Kubernetes clusters and virtual machines, this is not difficult.

Leveraging Datastream
While the initial idea was to batch ingest configuration and summary data into BigQuery and directly read event data from Kafka, an alternative was chosen. Given that event data is also available in our database, Google Datastream offered us the ability to stream near real-time changes from our database clusters directly into BigQuery. An added benefit is that Datastream automatically handles table schema evolution, meaning that table schemas are kept in sync between our databases and BigQuery. Moreover, the combination of Datastream and BigQuery guarantees that data cannot be stale longer than a certain amount of time, which we chose to be 15 minutes.

Metabase over Looker
Initially, Looker looked as the most promising tool to power our embedded analytics and business intelligence tool. It is the standard offering in GCP and it has an extensive list of functionality. However, the realization quickly kicked in that Looker comes with a substantial minimum price tag and a price per dashboard viewer. This makes it a less attractive option as in our platform we typically have an arbitrary amount of users viewing our dashboards. Instead, we went back to the drawing board and several other options were compared. In the end, Metabase stood out because of its query speed, customizability, it being open-source, and the ability to offer a data sandbox.

Journey of a commit

An integral component of a data platform involves effective code management and the ability to interactively work with data. At Solvimon, we use Gitlab to orchestrate our CI/CD pipelines. This enables us to have continuous deployment in our development environment and facilitates deployment to our sandbox whenever desirable. Only after successfully navigating these initial stages do we deploy to our live production environment. Both our sandbox and live environments are exposed to our customers.

In addition to our deployment capabilities, enabling interactive workflows is an important part of our data platform. A foundation for such workflows is built using JupyterLab, where our team is enabled to run and configure notebooks on their local machines. These notebooks can seamlessly connect to customizable Spark kernels that interface with data in BigQuery. This approach grants our engineers complete control over their development process, ensuring they can conveniently access cloud resources whenever needed. When it comes to data access control, this can be controlled on a per environment and per dataset basis. Granular roles are defined which restrict data access to the necessary subset of data. Typically our engineers work only with data in our development environment.

Choosing the right data model

Data modeling is a topic that often does not receive the attention it deserves. To fully harness the potential of a data technology stack, it is essential to carefully structure the data sources. Data modeling involves creating a conceptual representation of data, which guides us in determining the appropriate data structure and relationships. In our case, the starting point of the data model is the relational database, specifically in Postgres, defining the base dimension tables and fact tables. In our case the dimension tables are the tables containing configuration data (accounts, users, pricing configurations, etc), while fact tables are the event and summary tables (ingestion events, pricing summaries, etc). This was treated as a given.

For our data platform, I chose to combine two conceptual data models:

  • One Big Table Schema: Also known as the data model with one flat table or denormalized table, is a data modeling approach where all data is stored in a single table. In this model, there is minimal to no attempt to normalize data or create separate tables for related entities or attributes. Instead, all information is combined into a single, wide table structure.
  • Snowflake Schema: Not to be confused with Snowflake the tool. In this data model fact tables play a central role and are linked with multiple dimension tables. This data model is close to how our relational database is shaped.

The base tables follow Snowflake schema, this is the data that comes from the relational database. For business intelligence purposes it is in most cases enough to create materialized views on top of these tables. For data visualizations, dashboards and reports in our application, an approach is chosen that resembles the one big table schema. Namely, denormalized aggregate tables are created that contain all relevant information for a single dashboard or report. This significantly speeds up query execution, and is necessary to preprocess necessary information from complex data types (e.g. json or arrays).

In summary, data modeling is a crucial step in shaping our data sources to maximize the capabilities of our data technology stack. The choice to combine the two data models aligns well with our business requirements, offering both performance benefits and enhanced manageability.

Our first data product

The previous sections outlined our data platform’s design, technology selection, and workflow management. How does all of this benefit us?

In essence, our data platform offers us the capability to convert high-volume data into normalized and consolidated data sources through a robust deployment process. The subsequent phase was to integrate this with our front-end application to facilitate embedded analytics. To accomplish this, Metabase’s embedded dashboard functionality allows us to directly incorporate interactive dashboards and visualizations into our user interface. Multi-tenancy and data security is ensured by implementing user-specific access tokens, which restrict data access. Furthermore, dashboards are made interactive by embedding links into our tables and visualizations which automatically adjust filters. This enables customers to slice and dice data according to their needs.

Fast forward to today and we launched several dashboards which power our Insights product. The intention is to expand this use case by utilizing our data stack to power visualizations in various sections of our application.

Conclusion

I have guided you through our journey of constructing a data platform from the ground up. However, in our case this is just the beginning. Although deliberately opting for a simple use case, the groundwork that is laid down is scalable, flexible and capable of accommodating a wide range of future use cases. This includes but is not limited to, generating (smart) alerts, providing actionable pricing recommendations, revenue and margin forecasting, and powering our entire reporting solution.

I hope you enjoyed this blog! This is a repost of the original blog on the solvimon.com website.

--

--