Our journey towards an open data platform

Published in

Yotpo Engineering

8 min readOct 10, 2021

In this post, I’m going to take you through our journey of shaping Yotpo’s data platform architecture.

Navigating the flooded data technologies market can be confusing at times. We find ourselves mixing managed, open-source and self-development solutions to build a balanced stack. So many decisions to make along the way — all made under one clear principle — keeping our data platform as open as possible.

If all this sounds familiar or intriguing and you want to learn from our data experiences and challenges — I encourage you to spend the next few minutes with me.

Yotpo’s data platform

A data platform is a full ecosystem surrounding all aspects of an organization’s data. Data Platform Engineering is the craft of aligning different services, technologies and methodologies, designed to ingest, transform, maintain, enhance, and serve data inside and outside the organization.

Different data platforms have their similarities, but each platform holds a unique approach to data, to best reflect the organization’s needs and culture.

Here’s a general overview of our data platform

I’ve gathered a few interesting anecdotes from the current architecture to demonstrate the composition we’ve built and the events that helped shape it:

Query engines

We’ve pretty much worked with Spark from the get-go. We use our own open-source ETL framework called Metorikku to orchestrate our Spark jobs. Metorikku has enabled the generalist developer to write Spark pipelines in plain SQL. Metorikku also supports the use of scala UDFs for describing and testing complex transformations. Both Metorikku and Spark have opened doors for big data processing and introduced advanced analytics in Yotpo. Spark Structured Streaming brought real-time processing of big data to enrich our products and trigger the sizzling minds of our Product Managers.

We use Databricks as our ad-hoc interface for all data lake consumers to query data. The web collaborative notebooks have increased the adoption of the data platform by all users equipped with basic SQL knowledge. We also run Looker dashboards on top of Databricks! In the case of customer-facing dashboards, we use Snowflake as an acceleration engine, to provide sub-seconds data retrieval.

Data Ingestion

Change data capture is a pattern by which one can track database changes in near-real-time.

We started doubling down on CDC when our R&D embarked on the mission to break our monolith into microservices. This created a reality in which joining data across different domains was no longer possible at the RDBMS level.

Our original pipeline consisted of Debezium for syncing data into Kafka with Schema Registry, and Apache Hudi to merge the change events into the data lake, in an easy-to-query format of Parquet files. Mastering Hudi was both challenging and exhausting. We moved from MOR to COW (i.e Hudi terminology), tweaking configurations to enhance computing performance and latency, while trying to control the costs. We kept investing in Hudi, up to a point where we felt it was either going all-in or looking for an alternative to deliver better results.

That’s where Upsolver came in, we managed to scale our CDC performance remarkably. Data latency went down to under 1 minute, even for the biggest, most active tables, and the cost was reduced significantly.

To make this pipeline available to the rest of R&D, we made sure to automate and script all the moving parts using CI tools, like Jenkins and Terraform, which in turn triggered some in-house CLI tooling. We managed to keep our feature developers’ cognitive load as light as possible, while CDC became an inseparable part of our architecture.

Data governance

Early on, we started using Hive Metastore as our internal catalog. It was basic, but it gave us the ability to store metadata and offer some schema discovery for our data lake tables.
Soon after, we started handling PII data and GDPR requests, which emphasized the need for clear visibility on how and where we store personal information, as well as who can access it and why.

To our surprise, we did not find many solutions to fit our stack. The Spark + Hive + Databricks combo was the first barrier that ruled out most of the vendors we looked into. We were left with 2–3 options and started a PoC with 2 of them. Things were not easy.

Integrating those systems around Hive external tables, configuring Databricks with its many features and our varied use cases to use this platform for managing access — this required a long process to get the wheels running. Coping with Upsolver views on top of the insert and update partitions was another obstacle to overcome. We eventually overcame this using Okera.

Paraphrased from Okera’s blog, Image credits

While Okera serves as a Hive proxy for data retrieval requests, we built pipelines to identify catalog changes based on hive hooks to trigger auto-tagging of sensitive data on Okera’s side. Users in need of PII data, apply their reasoning in a temporary role elevation process that is audited and logged internally. Our CISO department uses the platform to apply access policies and monitor PII access.

Data quality

Let me divide data quality into 3 main types:

Data quality is correlative to data reliability. Maintaining data reliability and reflecting it, is important to keep your platform relevant and your users engaged. There is no one method to guarantee quality over time, and so all three types are required.

As part of composing a Metorikku pipeline, we cover both pre-process (unit tests) and in-process (AWSLabs Deequ) data quality checks. With these two we get to cover different data scenarios, or test complex UDFs using mock data and simple asserts. Things can still go wrong outside of our test coverage. This is where the post-process data quality comes in.

I like to call it the “holistic approach” to data quality. You can’t cover every aspect in your predetermined tests, but recurring scanning of your data after it’s written can highlight those blind spots. That’s exactly what Monte Carlo does.

We’ve known MC for a while, using them in other parts of the platform to track lineage between Redshift tables and Tableau dashboards. We worked together on designing and enabling the data lake integration on top of Hive and S3, to track data health and set up anomaly detection to alert in case of some abnormal behavior. This has proven to be a scalable and reliable way to monitor data health as pipelines expand and the data catalog grows.

Apparently, it’s called Data Meshing

We managed to build our data platform by investing in enablement infrastructure around every new technology we introduced. This approach requires extra effort around every new implementation but pays off when any Full-Stack Engineer, Data Scientist, BI developer or Support specialist can self-serve themselves around it.

This also means that our small team (4–5 engineers) handles laying down the infrastructure for self-serve shipping of data into the platform, but not for the actual data or data modeling of the different business domains.

Many reasons support this paradigm. Among them are the team’s size and its capacity, but moreover it is our belief that data should be handled by those who know it most intimately, hence “treat your data as a product”.

This has been a long and complex process of decentralizing data piping and shifting ownerships onto the data domain owners. What was an organic force of reality for us, was beautifully articulated as “The Data Mesh” by Zhamak Dehghani. Zhamak endorses this approach as a best practice for scale, efficiency and survivability.

Next generation

Lately, we’ve spent quite some time on deprecations and consolidations to make our architecture tighter, as one of our goals is to simplify maintenance and adoption. For the next generation of our data platform we’ve identified the following pillars of influence:

Data Analytics

We’ve learned to realize that our modeling infra should be better optimized for analytics purposes:

Users lacking SQL knowledge are dependent on dashboards and pre-built reports
Missing visibility on modeled domains and business logic
The analytics development process is long and difficult

All these are holding us back from becoming the super data-driven organization we ought to be! To get there, modeled data should be as low hanging as any data, with a strong self-service layer. In the center of this solution, we are going to place DBT as our modeling infrastructure. We’ll use it to build views or materialize them in the data lake, or onto different acceleration engines. We’ll expand our use of Looker with DBT based LookML models for stakeholders to self-serve via applications, reports or dashboards.

Data Observability

Our data exploration next steps should focus on ensuring information is available to further enable scale and velocity. The time has come for us to both utilize our existing observability infrastructure to the fullest, and replace missing components to complete the observability stack:

Tagging of data sets and data pipelines by product/business domain, by quality metrics, level of cleanness, raw/derived, sensitivity, and more.
Redesigning the catalog databases to better reflect the organization’s structure and needs.
End-to-end lineage throughout the platform (can’t wait to get into open-lineage). This is critical for both identifying data downtime implications (be it data corruption or exceeding latency SLAs) and aiding in the need for deprecations and maintenance work.

Streaming pipelines

We have long conquered raw data streaming. Next on our agenda is posing an alternative for cumbersome batch data pipelines. When dealing with these pipelines, we choose between managing increments or running full loads. The first option is fragile, the latter is expensive. Either way, data freshness is an issue. With streaming, the anecdotal load becomes sporadic and the overall load can lead to cost savings. Unfortunately, this is a complex task comprising:

many sources
late arriving data
many triggers to map
stateful vs. stateless joins

After examining several alternatives, we are currently looking into different Kafka-based technologies to join streams. Along with event-based architecture, CDC and outbox pattern — setting these joined streams under the relevant domain ownership, as close to the source as possible — would fit naturally into the picture. From Kafka, the new pipelines will stream into the data lake, and also be available for other potential consumers.

Are we done yet?

One of our R&D managers once asked me how much work we have left, i.e. How long will it be before the team can move on to some product development?
Well fellas, Data is our product! And it ain’t gonna be done anytime soon.

I guess that’s what’s so exciting about our line of work, as we find ourselves redesigning and adapting time and again. This is one of the reasons why this modular approach is so important — to always enable growth, agility and movement.