DataOps Self Service: Yotpo’s war on grunt work

Published in

Yotpo Engineering

6 min readNov 21, 2022

At Yotpo, we handle a lot of different data sets, like reviews, browser analytics, campaign statistics, and others, all originating from different sources like:

user-generated content
3rd-party services, like CRM
application data
application metadata

We specialize in transforming this pool of information into valuable insights. We have various ways to do this: training machine learning models, using decision support systems, and sometimes simply putting the data back in the hands of the customer.

Because our company that is so data-driven, our role as Yotpo’s Data Infrastructure Group is to improve the developer experience when using big data.

We focus on reducing the cognitive load of adopting a hoard of new tools, democratizing access to the data, and making things as self-service as possible.

We see self-service as a force multiplier that allows us to focus on the fun stuff we engineers love, by delegating the day-to-day work back to the people who can do it best: the owners of the data.

In this post, I’d like to share some examples of how we went about taking a data infrastructure process and making it easier to adopt.

Batch data pipelines

One of the most common tasks of data engineering is to create and maintain batch data pipelines. We use those to run all sorts of back-office flows: from aggregating data for reporting, through importing data from external services (like CRM, HR, and others), to training ML models with newly collected data.

At Yotpo, like in many other big data companies, we chose to implement them using Apache Spark as the execution engine, and Apache Airflow as the orchestrator and scheduler — all running on top of Kubernetes.

This means that in order to add data into our data mesh, a developer/analyst will have to learn three new complex tools that are not part of their day-to-day toolbox. And that’s before talking about the transformation process, or the different storage targets available.

Let’s lower the bar

Instead of adding a new stack of tools, we decided to use defaults, except for the bare minimum:

The data source(s)
The data destination(s)
The transformation to apply, defined in SQL
When to run the pipeline
How wide the pipe is, in terms of unit-of-scale

All that’s needed is a couple of YAML files to express these settings, and our custom wrappers take care of the rest. Once we got the basics covered, we improved the integration with Yotpo’s stack — throwing Slack, Grafana, Prometheus, and even cost monitoring into the mix.

Now, instead of our group fixing broken pipelines — we can focus on improving utilization and extending our infrastructure capabilities (and there are always new requirements :)

Here’s what it looks like:

Diagram showing the connection between Spark, Airflow and Kubernetes — Yotpo’s Batch Process Structure

Change data capture processes

blue update button — Updates by Nick Youngson CC BY-SA 3.0 Pix4free.org

At Yotpo, our data store of choice is mainly MySQL. While all is fine and dandy for running the application, it’s a whole different ball game when it comes to analytics.

This gave rise to a requirement of synching our applicative MySQL (and other databases) to the data lake, where it can be consumed and analyzed without affecting the user experience or the application performance. Once again we chose a commonly used solution: we use Debezium to convert the database’s binary log to Kafka events, that in turn are persisted to the data lake — providing a near-real-time replica of the application’s data.

To address database updates and deletes, we chose Upsolver’s service to complete our pipeline. It now looks like this:

Deployment Schema — Yotpo’s CDC Pipeline

Got a pipeline — now what?

Appetite comes with eating, and so a basic requirement with every new feature was to have its data synced to the data lake. As you can see, we have a lot of moving parts in this pipeline. Adding a new table to the data lake requires updating all of them. So how do we make things easier and less cumbersome? First, start with a well-documented procedure. Then — automate.

As mentioned earlier, we’d like to stick with our current tooling as much as possible. In this case, it means leveraging our automation server (Jenkins is our tool of choice). With the power of its DSL plugin, it’s quite easy to generate simple forms that in turn can run complex pipelines.

Over several iterations we automated away the different steps, starting from the most out-of-the-daily-routine tool (Upsolver) to the least (git). As the level of automation grew — so did the velocity of the teams using it.

Like with batch processing, this process is not perfect either. We would like to have a better way to track schema changes, and frankly, our monitoring could be better. But since our main focus is to get out of the way of the developer and make it less of a yak shaving — we consider it mission accomplished.

User tracking events

Stack of chocolate chip cookies with crumbs — Photo by RB

Like every other data-driven company that operates over the Internet, we’re really interested in what Yotpo’s users are doing. Which features are they using? Where are they spending their time? Why didn’t they complete that flow?

We use Segment.IO’s platform to collect data points to answer these questions, and then store them in our data lake for further analysis. It looks something like this:

User Events Diagram — Yotpo’s User Events Pipeline

As a data team, one of the features we found appealing is the ‘Tracking Plan.’ It enables you to describe and validate the schemas of the collected events. Unfortunately, having a UI that allows you to customize everything means there is a lot of room for error. Also, introducing another tool to a large user base has a training cost that we’d like to avoid.

Putting on blinders

A person with blinders — Credit: PCMag.com

Luckily, Segment has great API support and they love JSON as much as we do. Therefore, we decided that instead of using Segment’s UI we’d create a CI-based (Continuous Integration) process:

We use Segment’s tooling to generate the events’ schemas
We store all the schemas in a git repository
Whenever you change or add a new schema, our CI runs and validates that the change matches our guidelines. For example:
- properties must be camel-cased
- they must contain a description
- new types must be nullable
We ‘Release’ the new schemas’ version by posting them to Segment’s API.

Events pipeline diagram — Adding New User Event Schema

This way we can have more control over the changes being made, making them less error-prone. Not to mention, controlling the process through git enabled us to take advantage of all the tooling already connected to it.

This is a great example of how even a tool that was built for a data-intensive task sometimes needs to be customized and altered to fit an organization. Things that make sense in a development organization of a few dozens become a real problem when you scale to hundreds and more.

Wrap up

As you can see from these examples, there’s a lot to do in terms of improving the developer experience when it comes to data. Even the most polished SaaS tools can benefit from making them fit your current stack instead of the other way around.

While we’re all waiting for the data scene to mature and offer more ubiquitous and integrated ways to work with data, we’ll keep focusing on bridging that gap. We see that when we succeed, it gives a sense of ownership to the individual teams in the organization, removes friction and enables them to deliver value to Yotpo’s customers faster. And isn’t that what it’s all about?