Part 1 — Building a Data Pipeline for e-commerce platform environments (e.g. Marketplaces)

Maher Deeb
KI group
Published in
13 min readNov 15, 2021

This article was co-written by Maher Deeb, Yoann Dupe, Riccardo Bove and Swapnil Udgirkar from KI performance.

Tracking the performance of an e-commerce platform requires a comprehensive strategy to deal with big data challenges. Those challenges include identifying the data sources, collecting data, making the data usable and insightful along the way to achieve profitable data-driven business as a target.

At KI group, specifically in KI performance, we have been dealing with e-commerce platforms in different industries for a while. The data pipeline that we built recently for one of those e-commerce platforms is one of the latest outstanding achievements of the data team at KI performance. We successfully utilize the data lake as a source of the truth for all collected data. That enables us to produce 360-degree reports on the company level without losing the luxury of exploring the data at the granular level. Moreover, such a pipeline gives the freedom to every team to use its own reporting tools when it comes to local reporting.

The recent deeplearning.ai specialization about Machine Learning (ML) in production by Andrew NG summarizes the importance of those data pipelines that we built. In one of Andrew’s videos, he mentioned an exciting paper here, that shows the amount of work that engineers should do to bring ML code into production.

This series of articles is divided in 2 articles: the first one aims to walk you through the best practices we followed to build a scalable data pipeline like those we made for those e-commerce platforms and to list the Azure services that serve a pipeline the best. The second part of this series will introduce some of the challenges and lessons learned that we keep in mind for the future.

Data Pipeline Components

Data pipelines come in different forms and shapes, with each of them designed for a single purpose and project. However, they share many similarities. The two main components for a data pipeline are the data source and the data sink. Between the source and the sink, many steps can be performed on the data. The source defines where the data is currently available (data producer), and the sink is where the data should end up (for data consumer). Those are the beginning and end of the pipeline. The most important one that comes quickly after ingesting the data is the cleaning part. At this stage, the data will be processed to rectify, consolidate or identify corrupt or inaccurate records.

Data sources

In the e-commerce landscape, data comes from a variety of sources, including the front-end, the back-end, and all specialized tools that every team uses.

Engineering team

The back-end team works on developing or integrating the E-commerce core engine into the platform system. All transactions, payments, and customer data come from there directly into the data lake. The front-end team works on developing the storefront. Data related to products’ state and customer activities come directly or indirectly (through other integrated services. e.g., Google Analytics) into the data lake.

Product team:

Data related to products, prices, and stocks should be collected more than once a day.

Operations team

Data related to transactions and fulfilment state come directly from the operations team system into the data lake.

Marketing team

Google Analytics, Klaviyu, Segment, etc., data comes through the integrated APIs endpoints.

Customer services team

The raised tickets by customers through phones and Emails and their states are collected hourly.

Human resources

Data related to the current open positions, number of candidates applied, interview results, etc., are collected daily. A policy cares about deleting the data after a certain date or period that is defined by law.

Data sink

On top of that, building a data lake is essential to get all the data together and being able to get the most out of it. The idea is to be able to blend all the different sources to perform in-depth analysis combining data from those sources.

Ingestion layer

It’s often the case that a project consumes a multitude of different data sources, each of them coming from different projects that have their own dependencies, life cycles, unexpected downtimes. For this, we designed and built an ingestion layer that has two main powerful features.

  1. It can scale up and down to fit the demand of the data consumed.
  2. on top of that, in the case of downtimes, we implemented a self-healing system in order to consume data on a time range relative to downtimes

Storage layer

We have built a storage layer that consists of two main parts:

Data lake

In order to preserve the data in a centralized location, we make use of the Data Lake to preserve data in a cheap and unstructured format. We split the data lake into several zones: the raw, the cleansed, the curated, and the exploration zones. We store data of different quality for different purposes in those zones. We use a certain file structure to ensure that we store the data following a systematic approach. That helps to find the data fast and identify when the data is dumped into the data lake and where the data comes from.

Data warehouse

Through a data warehouse, we make the data available for everyone who would like to use it. The main purpose is to perform analytics queries to answer business questions. Keeping that in mind, we store the data using the star schema. In that schema, we store the data that should be analyzed or aggregated in large central tables called fact tables. Detailed information about the main columns of the fact tables is stored in the dimension tables. When storing data in the data warehouse, it is important to think about how to partition the data. Bad partitioning can lead to slow queries and high costs. Auto-scaling is one of the most important features that should be considered when dealing with big data in data warehouses.

Reporting layer

The goal of the reporting layer is to provide the data in good shape to the management teams. The report can come directly from the data warehouse or through a Business Intelligence (BI) tool. A report can contain static and interactive tables, charts, and KPIs that help the teams to trace their numbers and evaluate the performance of the business. Cross-team reporting can be very powerful to understand the contribution of every team to the business.

Processing layer

The processing layer reshapes the data according to the business requirements. That includes cleaning and standardizing the data, and standardizing the schema. We built the processing layer essentially between the data lake and the data warehouse. The ETL jobs in the processing layer extract the data from the data lake, transform the data to the required shape and load it to the data warehouse.

Azure Managed Services for Data Pipelines

When it comes to building a scalable, extendable, and maintainable data pipeline having a good architecture and the required skills to maintain that pipeline are crucial factors for long-term success. In a start-up environment where stuff moves fast, acquiring skills goes much slower than the business growing speed. Therefore, using managed services is a good alternative to manage uncertainty and keep the flexibility. Managed services remove the maintenance overhead and increase productivity.

When working with the e-commerce business, we know that the data we’re going to ingest, process and load are going to grow over time. Therefore, we must be able to scale the pipeline and the database accordingly without compromising end of chain reports and analysis systems.

Choosing a cloud provider depends on many factors such as availability in the region of the customer, the preferences and the experience of the customer’s tech team, prices, live support, legal aspects, etc. For our platform, we are using the Azure cloud provider to create the data pipeline and deploy it in the customer’s region.

Azure Functions

The Azure functions service is a great tool for those who search for a trade-off between flexibility and convenience. From one side, Azure provides the infrastructure, and you don’t need to care about the resources. It’s serverless and on-demand. On the other side, the Azure functions service supports many languages such as — Python, C#, JS, to name a few. Another benefit you’ll be able to leverage is the pay-as-consume pricing which can be very cheap, but other options (premium and App service) allow you to fit your need.

We have been using Azure functions as a part of the ingestion layer stack. In this marketplace’s case, we run Azure functions to collect data from more than 100 endpoints. Since every endpoint gives back data in different schemas, Azure functions help us to extract that data, test its quality before dumping it into the data lake. Azure functions are well integrated into CI/CD pipelines and TDD best practices. That helped us to maintain quality. More about Azure function here: Azure Function

Data factory

If you’re an experienced data engineer, you’ve already written ETL jobs in your favourite language, but what about abstracting the code and creating your pipeline using a UI?

If you are searching for convenience, that’s what Data Factory is about. An Azure ETL cloud service for data pipelines with a code-free UI. There is more to that, it’s a fully managed service that can easily scale, transform your data (using Data flow) and connect to many sources and sink from the database to API, to AWS, and even salesforce (see the long list of connectors here Data Factory connectors). If you’re worried about having very specific needs, you can inject Databricks notebooks and Azure function in the pipeline, the best of both worlds!

In this specific case, we use Data Factory to collect data from a legacy data warehouse and dump the data directly into the data lake. Integrating the Data Factory pipelines in a CI/CD environment can be a challenge if reproducibility is an important factor for your pipeline. More about Azure Data Factory here: Data Factory.

Event hub

In our e-commerce project, we’re ingesting data from a variety of sources that need to be routed to the independent system for further processing. To make this ingestion reliable for many events in a timely manner, we’re using Event hub.

Azure Event hub is a fully managed streaming platform and event ingestion service. It supports real-time and batch processing allowing you to process streams of data and store them in blob storage or data lake. It also provides easy integration with Azure Functions for a serverless experience. More about Azure Event hub here: Azure Eventhub.

Azure data lake G2

By definition, a data lake is a repository of data stored in its raw format, whether it is structure or unstructured. It allows to continually grow the incoming data volume up to multiples petabytes!

The data lake is essentially the central hub for all your data before using them for a specific project such as analytics or ML. It basically allows you to easily manage a massive amount of data. The Azure data lake G2 is built on top of Azure blob storage using a hierarchical namespace that organizes objects using slashes to mimic a hierarchical directory structure. Moreover, if you’re used to Hadoop, data lake storage G2 is Hadoop compatible using the ABFS driver in all Apache Hadoop environments (Azure HDInsight, Azure Databricks, and Azure Synapse Analytics). Finally, it is very cost-effective! More about Azure Data Lake 2 here: Azure Data Lake G2.

Databricks

Databricks is actually two products in one, with data analytics in mind. With Azure Databricks SQL Analytics, analysts can run SQL queries on their data lake. That makes it easy to query big data in a more traditional way using the well knows SQL language.

The second product is Databrick workspace based on Apache Spark. It is more suitable for a team of data scientists, engineers, and analysts who want to collaborate. From building ETL (with data factory), streaming data (with Apache Kafka), or reading data from the data lake to analyzing data with GraphX, Databricks is extremely versatile. You can even create collaborative notebooks and create an interactive dashboard with your team for a streamlined process of exploring data.

For our project, we built all our ETL jobs in Databrickes using Pyspark. One of the great features that we utilize to maintain the quality of those ETL jobs is the Databricks CLI. That helps us to integrate all of our ETL jobs in a CI/CD pipeline and automate our systems completely.

More about Azure Databricks here: Azure Databricks

Azure Synapse

As a part of the storage layer, Azure Synapse functions as a powerful distributed system that scales in both storage and processing units to satisfy the demand. On the reporting side, when Synapse Analytics was introduced in late 2019, Power BI was already a well-established Microsoft product. Here, a crucial consideration for greater cohesion between Power BI and Azure cloud services was to have tight integration between Azure Synapse and Power BI. In Synapse, this is achieved through Linked Services, which provides a direct connection to Power BI workspaces. Thus, users can first execute the ETL process and create reporting views within Synapse. Then, with the help of the Power BI workspace connection, these views can then be provisioned directly as Power BI datasets for creating reports and dashboards on top of them.

Azure analysis services

This consideration for a separate ETL process is where Azure Cloud services once again come into the picture. With more and more companies migrating their data and related infrastructure to the cloud, it became necessary to also provide them with robust ETL services. Azure analysis services is one such service in Azure. Azure analysis services provides enterprise-grade semantic data models for business reports and client applications such as Power BI.

Power BI as complementary tool for Azure services

Though Power BI is identified mostly as a visualization tool, this is just one of its components. Power BI, in its entirety, is a collection of software services (SaaS), apps, and connectors aiming to unify disparate data sources and visualizing the data for actionable insights. While the Power BI service allows companies to have workspaces for collaboration and sharing these insights through Datasets and Reports, the Power BI Desktop application provides users with powerful tools such as DAX to transform data and a range of customizable visualizations to building reports. It should be kept in mind, however, that though Power BI Desktop allows data transformations, it is not intended as an ETL tool. To transform a large volume of data and execute complex transformations, tools dedicated specifically for ETL are always a better choice.

Data catalog

Data catalog makes it easy to discover and understand data by registering any data source to the catalog. The data is not duplicated, but its metadata is added to the catalog, and any user can add annotations, tags, or any documentation to enrich it. That way, it is easy to find the right data you’re looking for across the entire organization, especially when the data is managed by different teams with different knowledge. Data catalog can help in many cases, such as knowing if a specific data exists, finding where the data is located, understanding data you’re not familiar with, knowing who I should ask to get access to the data, or managing a self-service BI system combining data from multiple sources. For our project, every piece of data that we store in the data lake or data warehouse is registered in the data catalogue.

More about Azure Data Catalogue here: Azure Data Catalog

Azure Active Directory

Of course, it is very important to keep your data secure and to restrain its access to the right people. You don’t want everyone to have access to sensitive data or critical financial KPIs, but you want to make it easy to administrate across many resources and applications. We are using Azure Active directory for our clients, a cloud-based identity and access management service. The advantage of the Active directory is that you can administrate external sources such as Microsoft 365, Azure portal, salesforce, and internal sources in your intranet. It comes with a single sign-on or multi-factor authentication feature and lets you create groups to manage access. You can create a direct, group, or rule-based assignment for a structural access system. In our e-commerce platform project, we used AAD to create different data user’s groups. Data engineers, Data Scientists, Data Analysts, and Business People. Each user’s group had access to a certain part of the pipeline. For example, Data engineers have access to all zones of the data lake. Data Scientists have access to all data lake’s zones but not the raw data. The Analysts have access to the curated zone. Business people have access to the Azure analysis services and Power BI.

More about AAD here: Azure Active Directory

Application insights

As part of the Azure Monitor service, Application insights will monitor your live applications by detecting performance anomalies to help you improve performance and usability. It works on many platforms and on-prem or public cloud applications. It also helps you understand how the application has been used. Application insights monitors many interesting metrics such as page popularity, response time, external service speed, session count, performance server machine, and many more. Not only you can collect all those data, but you can also explore them to get valuable insights! The Azure documentation lists several ways of getting the most out of it via dashboard, alerts, power BI or REST API.

In our project, we use Application Insights to monitor mainly the performance of the jobs in ingestion and ETL layers and debug any exception we observe. We connect Application Insights to Slack for near-real-time monitoring.

More about Azure Application insights here: Azure Application Insights

Data Flow building blocks and connections

This is it for the first part of these articles on Building the data pipeline for an e-commerce platform! Keep an eye out for the next article about the challenges and lessons learned on building this product!

At KI group we are looking for entrepreneurs, solvers and creators who want to make a difference by building sustainable, user-, customer- and planet-driven business models & solutions in a constantly evolving world. If you’re interested in working in a fast-paced diverse environment on a variety of projects, companies, products and technologies be sure to get in touch with us — we are looking forward to meeting you!

--

--

Maher Deeb
KI group

Senior Data Engineer/Chapter Lead Data Engineering @ KI performance