Part 2 — Building a Data Pipeline for e-commerce platform environments (e.g. Marketplaces)

Maher Deeb
KI group
Published in
5 min readNov 22, 2021

This article was co-written by Maher Deeb, Yoann Dupe, Riccardo Bove and Swapnil Udgirkar from KI performance.

Welcome back! In this series of articles, we have been sharing our insights on building the data pipeline for an e-commerce platform. This second article focuses on the challenges and lessons learned to keep in mind for the future. If you haven’t checked out the first article, follow this link.

Challenges and lessons learned

As you might imagine, a project of this scope has different kind of challenges, so let’s go through them!

Business and management challenges:

Uncertainty management:

At the beginning of a project of this scope, many decisions must be taken regarding the infrastructure, the minimum scale of the services, and the data we should collect. At that level, the data team should keep in touch with all other teams to understand their demands, source of data, and scale of that data. Changing the requirements is an everyday thing that the data team has to deal with, as other teams too.

Managing the uncertainty requires making assumptions. Those assumptions should be well documented and arranged in a way that they can be tested from day one. Our data team did a great job on that! We conducted research to figure out what could be the possible traffic on the e-commerce platform project during the next months. We calculated roughly how much data every user will generate. We focused on the flexibility of the infrastructure to satisfy the demand. Auto-scaling features in many Azure managed services helped us to consider the trade-off between the cost of the services and the uncertainty.

Enforcing data-based culture from day one:

The explosive amount of data in the industry compels us to enforce a data-driven culture, this should be done from day one since reverting in the future can be costly. Convincing and training the company’s teams to adopt and follow the data-based culture was one of the biggest challenges that we had to face as a data team. But it was worth it! As soon as the product went live, the data was there in Power BI from the beginning. Although there were many iterations to create a dashboard that satisfies the needs of the management team, it was very easy to consider the new requirements.

Here are some approaches we have used to enforce the data-based culture best practices:

  • Training all teams from day one and before launch to adopt and use data catalog.
  • Help the teams to develop their business questions that can be answered by and with data.
  • Expose the corresponding data sets to the corresponding teams: We have the infrastructure in place to expose the data to the interested parties, including data engineers, data scientists, and business people.
  • A streamlined process for data access providing: Don’t waste time with bureaucracy, as projects get larger and more components are added with their corresponding layers of security, chasing people around to have access can be very time costly. Establishing a streamlined process with a fully dedicated DevOps team can be a huge time saver!
  • Expose your data to validation from corresponding parties: When working with large datasets, it is often uncertain if all the calculated data is correct, to mitigate this we expose our data to regular checks by the corresponding parties.

Technical challenges

Unit and integration testing

One of the most frequent challenges that you find in your data pipeline projects is to find an appropriate way to implement unit and integration testing. Frequently there is no clear way to write the test since the data is often contained within a DataFrame.

In this scenario, we came up with a strategy to run both the unit and integration tests under Python’s unittest package and assert the results with pandas.

Firstly, in order to be able to write tests that assert that our ETLs are running as they should, we will have to collect the data, there is no way around that. So, in order to do this, we are taking advantage of DataFrame’s built-in function toPandas(), this will transform the spark DataFrame into a panda’s DataFrame, now we can approach this in a more conventional way.

Keep in mind that sometimes DataFrames can contain a large number of columns and checking every single one with an assert statement can be cumbersome, in this scenario we use assert_frame_equal from pandas._testing to assert the whole DataFrame in one go.

This is all fine and dandy as far as unit tests are concerned, but when it comes to integration tests a few more variables come into the mix:

  1. Ideally, the integration tests need to be run in the same environment as where the code is being deployed and use the same components.
  2. They need to be automated in the CI/CD pipeline.

To solve these two problems, we made use of Databricks tool dbx. (DataBricks CLI eXtensions — aka dbx is a CLI tool for advanced Databricks jobs management). Why would we need another CLI tool besides Databricks CLI? Well, it makes the deployment process much easier (And it uses Databricks CLI under the hood).

Having this said, with this tool in hand, we were able to run our test suites in a notebook in its corresponding environment, all through the CI/CD pipeline using two simple commands:

  • dbx deploy
  • dbx launch

And finally, to run the tests in a single notebook, we made use of unittest’s TestLoader and TestSuite, which allowed us to load multiple test cases into a single TestSuite. Then it will run the individual test cases in the order in which they were added, aggregating the results.

These were some of the challenges and respective solutions and lessons learned throughout the way of building this e-commerce platform project!

At KI group we are looking for entrepreneurs, solvers and creators who want to make a difference by building sustainable, user-, customer- and planet-driven business models & solutions in a constantly evolving world. If you’re interested in working in a fast-paced diverse environment on a variety of projects, companies, products and technologies be sure to get in touch with us — we are looking forward to meeting you!

--

--

Maher Deeb
KI group

Senior Data Engineer/Chapter Lead Data Engineering @ KI performance