SOLID Principles: A practical guide to design an ETL pipeline

6 min readSep 27, 2023

SOLID Principles: A practical guide to design an ETL pipeline

Introduction

In today’s data driven landscape, design and architecture play a critical role in determining the efficiency and maintainability of data processing systems. Following SOLID principles to design our application provides a robust framework for creating scalable, adaptable, and maintainable data processing solutions.

In this article, we will explore how each SOLID principles can be practically applied to design a simple ETL pipelines. By understanding and implementing these principles, developers can create ETL pipelines that are not only efficient at processing data but also flexible and easy to maintain.

We will understand these principles by creating a simple ETL pipeline with three major stages: data extraction, data transformation, and data loading.

For each stage, there can be multiple sources for data extraction, multiple tasks for transformation and multiple sinks for data loading.

Source -> Transform -> Sink

Single Responsibility Principle(SRP)

The SRP emphasises that a class or module should have only one reason to change. In the context of ETL pipeline design, this principle suggests that each component of the pipeline should have a clear and singular responsibility.

Consider an example, where a single class called KafkaSource does everything since reading from source, applying transformation and loading data into sink.

class KafkaSource { 
        public void readStream() {
            System.out.println("reading from kafka");
        }

        public void applyTransformation() {
            System.out.println("applying transformation");
        }

        public void writeStream() {
            System.out.println("loading to hdfs");
        }
    }

Although our KafkaSource class works well in this example, we can read from Kafka, applying some transformation and can load data into HDFS. However, this violated the principles of SRP as the Source should have only one responsibility to read data from source only.

Following SRP, we can fix this by creating distinct classes for each stage. This separation ensure if changes are needed in the transformation process, it won’t affect the extraction or loading stage.

    class KafkaSource {
        public void readStream() {
            System.out.println("reading from kafka");
        }
    }
    
    class Transformation {
        public void applyTransformation() {
            System.out.println("applying transformation");
        }
    }
    
    class HDFSSink {
        public void writeStream() {
            System.out.println("loading to hdfs");
        }
    }

By adhering to SRP, the ETL pipeline’s codebase becomes modular, and changes to one component don’t impact others. This principle enhances collaboration among developers working on different parts of the pipeline and allows for easier updates as data processing requirements evolve.

2. Open/Closed Principle

The Open/Closed principle stated that software entities (classes, modules, function) should be open for extension but closed for modification. We should be able to write new modules with new features that work with old code without touching the old code.

In ETL pipeline design, this simply means, we should not handle the old implementation and create a new version of it by adding additional features to the new class to avoid potential bugs.

For example, consider our ETL pipeline’s Sink stage. We have initially designed it to handle simple data sink. To make it open for extension, we can introduce an abstract class, `Sink`, and concrete implementation like `RedisSink` and `HBase` sink. If in future, we need to add more sinks, we can simple extend Sink without modifying existing code. This allowing the pipeline to evolve without disrupting it’s core functionality.

A class diagram of open/closed principles

3. Liskov Substitution Principle (LSP)

This is an extension of the open/closed principles and bit difficult to understand initially. Let me make it simple for you and give you an in-depth knowledge about it.

It defines that object of a superclass shall be replaceable with the object of its subclass without breaking the application. The more generic definition can be that derive types must be completely substitutable for their base types.

For simplicity, remember the implementation guideline of LSP, and we will prove all these points in the example below.

1. Clients should not know which specific subtype they are calling.
2. The subtype can throw no new exceptions.
3. New derived classes extend without replacing the functionality of old classes.

Okay, enough theory. Let’s understand this with example -

In previous examples of open/closed principles, we have used Sink(base type) to implement different types of Sink (derived type) like Redis and HBase. The Sink define a contract where all derived type must have to create a connection and then write stream.

Sink objSink = new RedisSink();
Sink objSink = new HBaseSink();

Here, we have created a Sink object consisting of both Redis and HBase sink, and if you see closely, the derived type RedisSink and HBase completely substituted base type Sink class. So based on LSP we have achieved that derived type are completely substitutable by base class.

Now call createConnection method -

objSink.createConnection();

Here also if you will notice, without using the subtype we are creating the connection of sink from the base class itself. Still this is error prone though the basic principles implemented. Let’s understand why -

Let suppose, if we want to add a new sink type called `ConsoleSink` to debug the events by writing it into the console. This means `ConsoleSink` can’t meaningfully provide the createConnection method to write any event in console. The reason being, there is no connection required to write any stream in console. One common workaround of this is to make ConsoleSink throw an `UnsupportedOperationException in the method it cannot implement.

class ConsoleSink implement Sink {

  @Override
  public void createConnection() {
     throw new UnsupportedOperationException("connection creation not supported by ConsoleSink!!");
  }

  @Override
  public void writeStream() {
     System.out.println("writing events into console");
  }
}

Now from our client code if we call -

objSink.createConnection()

Unsurprisingly, the application crashes with the error:

connection creation not supported by ConsoleSink!!

However, by not supporting createConnection() method in ConsoleSink violets the method contract and clearly it violates the principle guidelines of LSP which states that “no new exceptions can be thrown by the subtype”.

Let’s refactor this code to make it compliant with LSP -

Step 1 — Because all sink does not support creating a connection so we moved the createConnection method from Sink class to a new abstract subclass SinkWithConnection

Public abstract Sink {
           public void writeStream();
}

Public abstract SinkWithConnection extends Sink {
           public void createConnection();
}

Step 2 — implement ConsoleSink by extending Sink class which has only one method writeStream().

Public class ConsoleSink extends Sink {
           public void writeStream() {
                       System.out.println("writing events into console");
           }
}

Step 3 — Both RedisSink and HBaseSink allow creating connection so they both have now made subclass of new SinkWithConnection.

class RedisSink extends SinkWithConnection {
           @Override
           public void createConnection() {
                       System.out.println("redis connection created");
           }

           @Override
           public void writeStream() {
                       System.out.println("inserting events into redis");
           }


}

Step 4 — Create a service which only accepts subtype of SinkWithConnection.

public class SinkConnectionService {
   
private SinkWithConnection objSinkWithConnection;

   
public SinkConnectionService(SinkWithConnection objSink) {
        this.objSinkWithConnection= objSink;
   }
  
public void createSinkCoonnnection() {
        objSinkWithConnection.createConnection();
   
  }
}

Now from the client code you can easily substitute the superclass SinkWithConnection with either of it’s subclass RedisSink or HBaseSink.

4. Interface Segregation Principle (ISP)

This simply tells that break up the larger interface into multiple smaller Interface. In ETL pipeline design, this translates to creating specialized interfaces that cater to the needs of specific components.

Suppose we have interfaces like DataExtrator, DataTransformer, and DataLoader. Rather than combining them into a single monolithic interface, we segregate them to match the specific requirements of each component. This prevents classes from being burdened with unnecessary methods and promotes a more cohesive design.

5. Dependency Inversion Principle (DIP)

It encourages high-level modules to depend on abstractions, not concrete implementation. In ETL pipelines, this means that the core pipeline components should rely on interfaces rather than specific implementation and this way promoting flexibility and testability.

It allows us to remove the hard-coded dependencies and make our application loosely coupled, extendible and mountable. Dependency Injection can be achieved by injecting dependencies into the client and moving dependency resolution from compile-time to runtime.

By applying the SOLID principles as demonstrated, you can design your application not only efficient and modular but also maintainable and extensible. These principles form the foundation for designing robust data processing systems that can evolve with changing requirements.

Written by Ashutosh joshi