Introducing Metorikku — Big Data Pipelines using Apache Spark

Published in

Yotpo Engineering

4 min readJan 22, 2018

Big Data solutions and platforms have become a major trend in the tech world. We can better understand our data and get meaningful insights from it. The pain is, how to store, manage and process the data effectively.

Usually the challenges are to compute real meaningful insights, build personalised recommendation systems, increase engagement, apply fraud detection, churn prevention, reporting, text analysis using NLP, or any other processing.

In this blog post, I will be introducing Metorikku, and how we use it to build batch data pipelines, combining many data sources efficiently, saving valuable resources and time.

What is Metorikku?

Metorikku is a distributed ETL engine built on top of Apache Spark SQL. By creating a simple configuration, you can define your input sources, your data manipulation steps and lastly your output sources.

Metorikku integrates with a wide variety of popular data sources such as Cassandra, Redshift, Redis, Segment, JSON, Parquet and so on. Data manipulations are performed by a predefined set of metrics which runs using Spark’s distributed SQL engines.

Whether you need to support A/B testing, train machine learning models, or pipe transformed data into various data stores, Metorikku provides the infrastructure to perform all data preparations. On top of it, you can apply machine learning models, perform graph processing and the organisation’s various applications can consume the results.

Metorikku came to life after we encountered several similar use cases again and again. From data cleaning and preparation tasks to powering dashboards and email digests in production, Metorikku is used by several teams in Yotpo including Big Data, Data Engineering, Data Science, BI Analysts and Full Stack Developers.

By exposing our different data sources and stores and making them easily approachable and query-able, we have definitely made an impact by making our company more “data-driven” than ever before.

How it Works?

All you need to start is a running Spark cluster. Requires Apache Spark v2.2and above.

Metorikku has easy-to-use instructions and flexible configurations, so you can easily create data processes.

We like to schedule Metorikku jobs using Apache Airflow made by AirBnB.

Running Metorikku requires defining your inputs, outputs, and metrics. Metorikku loads the data, initialises a Spark Session and registers the input tables.

A metric is defined by its SQL steps in the metric configuration file, and each step defines a Spark DataFrame which you can select from on your next steps. After the DataFrames are computed, Metorikku output writers handles the writing process.

Basic Configuration Examples

To run Metorikku you must first define 2 files.

The first is a YAML configuration file which includes your input sources, output destinations and metrics files locations — for further explanation on the different configurations Metorikku support please go the project repository on Github.

A simple movies.yaml file for Metorikku could be as follows:

a metric file defines the steps and queries of the ETL as well as where and what to output.

For example, a simple configuration JSON should be as follows:

Notice that once we registered our inputs using the YAML configuration, we can now use them inside our queries as a given data source.

Metorikku also provides a built-in testing framework named “Metorikku Tester”, which helps writing tests to your data manipulations by defining mock data and the desired outputs.

Simple mocks file are defined using the JSONL format

Lastly, we can feed the following JSON file to Metorriku Tester in order to run our tests

Looking Ahead

It’s impressive how much the big data infrastructures and tools have improved over the past years. We’ve come a long way through Hadoop clusters and complex MapReduce implementations, to easy-to-use frameworks, data infrastructures and APIs.

With few exceptions, you shouldn’t build infrastructures or tools from scratch these days, and you can save a substantial amount of your developers capacity. I expect that big data infrastructures and tools will continue to grow fast, at least as fast as the data itself!

The new Metorikku tool offers a simple platform to combine your data sources, so you can analyse, test, and write the results to a data source of your choice.

Metorikku will continue to evolve, adding useful features in the future such as adding Kafka or other streaming platforms to enable the creation of Lambda applications, adding more types of writers and readers such as JDBC, a web interface for easy creation of metrics and configurations and much more.

You are welcome to check out our contributing guide, or comment for any further questions.

Introducing Metorikku — Big Data Pipelines using Apache Spark

What is Metorikku?

How it Works?

Basic Configuration Examples

Looking Ahead

Written by Ofir Ventura