Plug-in Architecture

and the story of the data pipeline framework

Published in

OmarElgabry's Blog

9 min readMay 1, 2019

This is one of the things that we probably use and interact with on a daily basis but seldom do we realize its existence.

Not only does it exist in product-based applications such as Eclipse (IDE) or any browser where it can be customized by adding plugins and extensions, but it also exists in business-based applications where business rules and data processing logic vary according to, say, a country such as in insurance claim and taxes applications.

The underlying idea is simple. Being able to plugin features to an existing component without this component having to know about the implementation details of these plugged-in features.

It sounds like I’ve heard the same sentence before.

Polymorphism (OOP)?. Protected Variations (Design Principle)?. Strategy Pattern (Design Patterns)?.

And yes, that’s right!. They all embrace the same concept.

Description

The plug-in architecture consists of two components: a core systemand plug-in modules.

The main key design here is to allow adding additional features as plugins to the core application, providing extensibility, flexibility, and isolation of application features and customs processing logic.

The specific rules and processing are separate from the core system. At any given point, we can add, remove, and change existing plugins with little or no effect on the rest of the core system or other plug-in modules.

Core system

At a high level, it defines how the system operates and the basic business logic. There is no specific implementation, no customization. It is abstracted.

A simple example. The generic workflow, such as how the data flow inside the application is defined. But, the steps involved inside that workflow is up to plugin. And so, all extending plugins will follow that generic flow providing their customized implementation.

Digging deeper a bit, it also handles special cases, applies special rules, and complex conditional processing. These are the things that need to be enforced regardless of the extending plugin.

In addition, it also contains the common code being used (or has to be used) by multiple plugins as a way to get rid of duplicate and boilerplate code, and have one single source of truth.

For example, if two plugins do log the transactions and failures, the core system should provide such a feature logger as part of it. Not to mention things like security, versioning, UI components, database access, caching, etc.

Plug-ins

The plug-ins are stand-alone, independent components that contain specialized processing, additional features, and custom code that is meant to enhance or extend the core system to produce additional capabilities.

Generally, plug-in modules should be independent of other plug-in modules. Though, some plug-ins require talking to, or assumes the presence of, other plug-ins. Either way, it is important to keep the communication and the dependency between plug-ins as minimal as possible.

Core←→Plug-ins

The core system needs to know about (1) the extending plug-in modules and (2) how to get to them.

The core system declares extension points that plugins can hook into. These extension points, these hooks, often represent the core system life cycle.

And so, each plugin registers itself to the core, passing some information such name, communication protocol, input/output data handlers, data format, and hooks into these extension points.

There should be a well-defined interface between the core and the plugins.

How the core system connects to these plugins is entirely based on the type of the application building (small product or large business application) and your specific needs (e.g., single deploy or distributed deployment).

A quick glimpse over the different ways of connecting any two components.

Configurations

By now you should have already noticed that these plugins need to announce themselves to the core system, tap into the extension points, and pass some information.

That brings the idea of configurations; The glue, point of contract, by which we attach a plugin to the core system, where everything mentioned above (extension points, comm. protocol, etc) is defined.

// This plugin hooks into the core system life cycle
// and saves the daily activities events emitted 
// by core into a database.core.registerPlugin({
   name: 'track-my-activities', 
   port: 8081,
   hooks: {
     wakeup: function (time) {
        saveTime("Woke up at: " + time);
     }, 
     work: function (time) {
       saveTime("Started work at: " + time);
     },
     exercise: function (time) {
       saveTime("Exercising at: " + time);
     }
    }
});

The configurations not only gives us the agility but also allows us to visualize how each plugin works and perhaps, the flow of the data.

If a glueing a plugin to an existing core system is a pain due to having an incompatible interface, it is common to create an adapter between the plug-in and the core system. And so the core doesn’t need a specialized code for each incompatible plug-in.

Where the configurations can be stored?

The configurations can be either in the code itself, passed to a CLI tool, or stored in the database. It can be written in various languages like YAML, TOML, JS, JSON, XML, etc.

Whatever the language you decide to use, make sure you’re able to do things like comment, conditions, validations, loops, or whatever needed. Clearly, that’s not possible with JSON.

In addition, a small tool can be built to validate these configurations.

One Vs Multiple instances

We’ve mentioned two common use cases. The product-based and business-based applications.

Adding plugins and extensions to our browsers is something we’ve gone through several times. That’s one core system (browser) where plugins can be attached to it.

On the other hand, especially on large business applications, we can have multiple application instances, where each instance extends the core and adds one or more plugins to it. In the example of the browser, this means having multiple browsers, each with its own set of features.

So, we essentially working with multiple applications extending the core system, each is working on its own, totally independent.

Other patterns

This pattern, not surprisingly, solves a particular problem. And so it can shape the entire architecture, be part of, or used along with, another architectural pattern.

For example, you have a supposedly nice layered architecture, where the plugin-architecture is embedded inside it.

Moreover, you can have a component-based architecture, where the application is composed of components (customer, order, payment, etc), and these components are the main parts of the core system.

Analysis

The nature of this architecture comes with many advantages.

Since these plugins are independent, this gives the agility to be able to quickly change, remove, and add plugins.

Reducing the headache of having a bunch of services, of plugins, talking to each other and handling failures as a result. The behaviour of the application is somehow can be predicted.

Depending on how the pattern is implemented, each plugin can be deployed, tested, and scaled separately.

However, there is a major pitfall. That’s the core system itself. Changing it might break or alter the plugin’s behaviour. And so, It requires thoughtful design at first.

Everything we’ve mentioned from defining the possible extension points, connecting from core to plugin, versioning, enforced business rules by the core, they all contribute to the complexity involved with implementing this pattern.

You’ll end up with a complex core system, full of a bunch of if-else conditions, plugins independency is no longer a characteristic, and changing one line of code will require an arsenal of analysts, developers, and testers.

Nevertheless, there is an existing story to be told:

A data pipeline framework that enables you to collect & store, process & analyze, and consume the result data using nothing but configurations.

Data Pipelines

They are everywhere. Almost everyone has touched on that idea. Even though it sounds ambiguous, not clear, and doesn’t imply a specific use-case, the underlying concept is simple.

You have a bunch of data knocking your door, regardless of where they are coming from or how they were generated, whether they are real-time or batch data, you want to:

Store → The data is being collected and stored.
Process → … is converted, parsed, cleaned and analyzed.
Consume → … can be queried, visualized, and alerts can be sent out.

That might sound OK for a single use case.

But, what if we need to work with different types of data coming from different sources. Transactional data (placing an order, making a payment, etc), files/objects, real-time sensor data, are all different.

And because they are different, each has to be processed differently and independently for different purposes.

Analyzing CSV files is not the same as analyzing logs coming from the web application. Perhaps, with logs, we want to detect failures or frauds, while extracting some information from and cleaning the CSV files.

That sounds like having multiple pipelines, multiple applications.

And yes, that’s true.

We can also take advantage of “Plug-in Architecture” encapsulating the three pipeline stages: Store, Process, Consume.

These three stages are the core system, while data source, storage, processors, consumers, all these services and tools can be plugged-in into a pipeline.

And what it means is that each pipeline is configurable. We define where the data is coming from, supply the code for processing it, and define who’s going to consume the end result, … and BOOM! A pipeline is up and running.

Requirements

To summarize the requirements:

The framework has to be as agnostic as possible in order to support the widest range of possible use cases.
Pipelines are easily deployed, scaled, and tested independently.
Infrastructure maintenance is minimal, with no devops expertise required.
Support different types of data (JSON, CSV).
… data sources (HTTP endpoint, AWS CloudWatch, sensor devices)
… data storage (in-memory, NoSQL, relational, file/object storage)
… data processing (Lambda functions, Spark, Hive, Kensis Analytics)
… consumers (Lambda function, S3, emails, Slack, visualization tools).

For the sake of simplicity, and most of the time, you’ll only need a handful of these services. You’ll only have a handful of use cases. Later, you can expand and add more services as needed and support more cases as well.

Workflow

We’ll support two use cases. We have normal transactional data and files are being uploaded. Each goes through a different pipeline, a different application.

[1] Collect & Store

HTTP endpoint is exposed through our API backend. Users can send some data or upload CSV files.
Transactional data are then stored in NoSQL database, while files are stored in AWS S3.

Once data is stored, the “Process” stage kicks-off. This is handled by the core system. Internally, it has a serial queue, where it inserts a task, and that task triggers the “Process” stage. This applies to all the pipelines extending the core system.

[2] Process

For transactional data. The first step in processing is classifying and performing normalization. The second step is to identify potential frauds.
For CSV files. The steps are: extract, clean, and then store the result back to a relational database.

Once done, all the consumers will be notified. Again, the core system takes care of the navigation from one state to the next.

[3] Consume

For transactional data. We would like to send an alert to Slack giving a heads up that a fraud has been detected.
For CSV files. The web UI application will consume the result data stored in the relational database. We also want to send an email.

Final Thoughts

Of course, there are many ways of doing it.

For example, we could’ve sensor devices sending raw data to a Kensis which stores the data for a while and passes it to a lambda function for processing.

The core system itself can use the HTTP request-response instead of a queue. Or, maybe, it can be as simple as a method call.

References:

Thank you for reading!

Feel free to reach out on LinkedIn or Medium.