Do you control and own the code you write?

Code control and ownership in Data Science/Engineering.

Maher Deeb
KI group
8 min readDec 7, 2021

--

Clean code best practices in Software Engineering are essential to consider feedback and changes in the requirements from the customer side and keep the project on track. For data projects, delivering clean code plays a more critical role in creating value than other projects. The high level of uncertainty and the high frequency of requirement changes increase the data project failure risk if the engineers cannot react and deliver quickly in such an environment.
Based on my experience, data scientists/engineers can solve many of the problems successfully. However, when it comes to delivery, most of them fail dramatically. In most cases, the value of their work is lost because of their lousy delivery.
We in KI performance pay close attention to the delivery. We are fully aware of the uncertainty of data projects and the needs of our customers to change their requirements in the short and long term. Adopting clean code best practices enables our team to control and own the delivery and have the competence to meet the customer’s expectations.
In this article, I present a scenario that shows the importance of writing clean code for maintaining the Data Science/engineering solution’s value on both the customer and the data team sides.

Real-World Scenario

The customer asked an external data scientist/engineer to process the data stored in two CSV files and store the processed data in new CSV files. The customer defined two logic processes:

For the first set of data, “data1.csv”:

  1. load the data
  2. assign the correct columns’ names
  3. fill out the missing values
  4. drop invalid data
  5. sort the data
  6. group the data by age
  7. store the processed and grouped data

For the second set of data, “data2.csv”:

  1. load the data
  2. assign the correct columns’ names
  3. group the data by company
  4. store the grouped data

Of course, you can extend that logic above further to train a model on top of the data, create a dashboard, or write a report for the managers.

Let us assume that the data scientist/engineer used Python together with the Pandas framework to develop the solution. During that time, the customer built a data team. That team on the customer side developed a new framework that would replace Pandas. The new framework was super confidential, and they did not want to share it with the external data scientist/engineer. The customer’s data team decided that they didn’t want to use Pandas framework anymore. Therefore, the customer asked the data scientist/engineer to remove every line of code related to Pandas and send the rest of the code back.

I am going to present the solution written in four different ways. For every case, I am going to raise the following questions:

  1. What is the value that the code produces?
  2. Does the data scientist/engineer control the code?

Case 0:

Before removing Pandas code:

After removing Pandas code:

Code structure:

src/
├─ logic2.py
├─ logic1.py
├─ data/
│ ├─ data1.csv
│ ├─ data2.csv

You may observe the code style above in most of the data science and data engineering projects. The code is straightforward. However, such concrete implementation of the solution has a fatal impact on the project in the long run. As you can observe, after removing the code part related to Pandas, the data team must entirely implement the logic from scratch again.

What is the value that the code produces?

I think we can agree that this code produces no value to the customer.

Does the data scientist/engineer control and own the code?

Apparently, the data scientist doesn’t own the code. Pandas fully owns it.

So what is happening here? The engineer encapsulated the logic of the solution and ingested it in Pandas’ code. It means the solution depends entirely on the methods from Pandas framework, and the logic can’t survive alone.

How can the data scientist/engineer improve the code and the delivery to maintain the solution’s value after removing the Pandas framework from the code?

Case 1

Before removing Pandas code:

After removing Pandas code:

Code structure:

src/
├─ logic2.py
├─ logic1.py
├─ data/
│ ├─ data1.csv
│ ├─ data2.csv

Although the logic depends on the Pandas’s methods, the engineer wrapped those methods inside a class. There is a method for every required step. When executing the code, an object is initiated from the class. Then the methods are called after each other in a different place. Such code style helps the engineer to create an abstraction of the solution. The reviewer can figure out what the code does without knowing about the details. It is like knowing what a book is about only by checking its table of content.

What is the value that the code produces?

Although the logic is lost, the data scientist/engineer was able to keep the steps for the solution. However, the value is negligible.

Does the data scientist/engineer control the code?

The engineer needed to drop roughly half of the code. It means the engineer might own about 50%.

Similar to case 0, the engineer ingested the logic inside the Pandas framework.

In the next step, we will isolate the Pandas framework and keep it away from the business logic.

Case 2

Before removing Pandas code:

After removing Pandas code:

The logic inside the files logic1.py and logic2.py stays the same

Code structure:

src/
├─ logic1.py
├─ logic2.py
├─ utils/
│ ├─ __init__.py
│ ├─ utils_pandas.py
├─ data/
│ ├─ data1.csv
│ ├─ data2.csv

We created a new module, utils, where we keep all 3rd party frameworks isolated. The business logic calls those 3rd party frameworks through functions that wrap the original Pandas’s methods.

What is the value that the code produces?

Obviously, now we don’t need to touch the code where the business logic is implemented. The code can be delivered without the utils part where the Pandas framework is isolated. The customer knows about the functions and their inputs that the internal data team needs to consider. However, the code is not executable unless the customer implements the utils code. The value is much higher than the previous two cases but not optimal.

Does the data scientist/engineer own the code?

The engineer needed to exclude the utils part, which is about 25% in the current case. The engineer’s code ownership might be about 75%.

How can we optimize the value of the code and keep the code executable? That is what the following case shows.

Case 3

Before removing Pandas code:

After removing Pandas code:

We need to extend manger.py and create a new class for the new framework:

Code structure:

src/
├─ utils/
│ ├─ utils_pandas.py
│ ├─ data_model.py
│ ├─ __init__.py
├─ data/
│ ├─ data1.csv
│ ├─ data2.csv
├─ logic/
│ ├─ logic_with_pandas.py
│ ├─ __init__.py
│ ├─ logic_with_new_framework.py
│ ├─ logic_interface.py
│ ├─ manager.py
├─ main.py
├─ config.py

Here, we bring the code to a new level. On the logic side, we create an interface that contains the signature of the solution. The engineers need to follow that signature to integrate new frameworks. The solution can be implemented for every framework separately by inheriting from the logic interface. We implemented the manager class which is responsible for managing the 3rd party frameworks that the logic needs to execute. The logic has no idea about those 3rd party frameworks. We implemented the utils, the same as shown in case2. The engineers can control the pipeline using the config file without editing any line of code. We implemented main.py as an entry point to the pipeline.

Two fundamental changes that we introduced in the current case are:

  1. leveraging the inheritance to develop logic1 on top of logic2, taking advantage of the similarities between both logic 1 and 2 to eliminate duplicated code
  2. using data models to encapsulate the input data in an object that we can pass to the manager

The second point has a huge advantage compared to passing loose arguments into the pipeline. Suppose we need to pass many arguments, the risk of swiping the values increases. Such loose arguments lead to significant bugs that engineers need to spend hours and even days to discover.

What is the value that the code produces?

The data scientist/engineer could deliver the code without editing any part of it. Since we have a config file where the user can tell which framework would like to use, the Pandas framework could be activated and deactivated using the framework_type key. The manager manages the dependencies, and the business logic has no idea about those dependencies. We could plugin any other frameworks easily. What we have to do is to EXTEND the code further at the manager’s side.

Although the pipeline looks complex, it provides a significant value in the mid and long term.

Does the data scientist/engineer own the code?

The engineer has full control of the code. Any new framework that the engineers want to implement should follow the signature of the interface of the logic. Ownership is close to 100%.

Object-Oriented-Programming (OOP), Open-Closed Principle and Dependency Inversion Principle are the leverage here.

Following the OOP style has many advantages and helps engineers write and maintain the code in the long term.
Open-Closed Principle, which means the code is open for extension and closed for modification, adds the power of implementing new features without hurting the currently running software.
The Dependency Inversion Principle decouples the logic you control from the low-level details logic you don’t control. Your logic stands alone, and you can replace every low-level 3rd party framework as easily as changing a value of a key at the configuration level.

Summary

In this article, I introduced a real-world challenge example where solution delivery plays an important role in maintaining the value on both the customer and the data team sides. I started by presenting a typical code art observed in most data science and data engineering solutions. From there, I moved step by step to better and resilient solutions by avoiding editing the business logic and keeping it away from the 3rd party frameworks.

In the final solution, case 3, the engineer has complete control over the code. The engineer enforces new frameworks to follow the solution’s signature and enables the user to switch between the frameworks without editing the code.

The example in this article reflects many of the challenges that we face every day in our projects. SOLID principles and Test Driven Development (TDD) are some of the best practices that we leverage to meet those challenges.

Giving our customers the power of changing their requirements at any phase of the project and having the ability to address those changes quickly guarantee long-term success and cooperation.

If you want to join our team and enjoy solving data science, Data Engineering, and data analytics problems, check our open positions at KI performance.

If you have any questions, let us talk. We are happy to answer all your questions.

At KI group we are looking for entrepreneurs, solvers and creators who want to make a difference by building sustainable, user-, customer- and planet-driven business models & solutions in a constantly evolving world. If you’re interested in working in a fast-paced diverse environment on a variety of projects, companies, products and technologies be sure to get in touch with us — we are looking forward to meeting you!

--

--

Maher Deeb
KI group

Senior Data Engineer/Chapter Lead Data Engineering @ KI performance