Software Architectural Patterns in Data Engineering
The design philosophy behind awesome big data technologies
What makes big data technologies highly scalable, available, fault-tolerant, performant, reliable, etc.? I believe that the answer to the question is concealed in the architectures of the big data technologies. Without a solid design philosophy, developing incredibly amazing technologies that handle data so effortlessly at such a scale would be tough. So, let’s have a look at the architectures of the popular data engineering technologies. At Expedia Group™, the business decisions are driven by data. As a result, data engineering is crucial, and it is carried out using some of the technologies listed below.
- Apache Hadoop
- Apache Spark
- Apache Hive
- Apache Kafka
- Apache Airflow
- Server-less (Functions as a Service, e.g. AWS Lambda)
- Tableau or Grafana or other visualization technologies
And these data technologies are common across the industry. The big data processing is carried out in the following three different modes.
- Batch, where size-able data chunks are processed at a frequency.
- Real-time, where data is processed as soon as it arrives in the system.
- Hybrid, combination of the both the approaches.
Some data products are expected to produce results once in a day, those could be built with batch approach. Others may require results in real time or near real time, for example, finding top 10 hashtags every 10 minutes on Twitter.
After studying the big data technologies listed above, it was observed that these data technologies are developed considering above listed data processing approaches and the data access patterns which are primarily SQL, APIs, and visualisations. As a result, it may be stated that these data technologies are guided by a design philosophy. This design philosophy has resulted in the development of various architectural patterns in data technologies, which are summarised below.
- Layered architecture
- Actor-reactor architecture
- Micro-kernel or plugin architecture
- Micro-service architecture
- Event-driven architecture
Let’s study each one of these and try to find the answer to our question.
In the layered architectural pattern, the technology is divided into multiple layers where each layer performs a specific function in the overall technology/software, and the communication goes from one layer to the other layer. There are usually more than 2 layers in such a style as shown in the diagram below.
As per in the diagram above, the users interact with the user interface layer, the instructions from the user interface layer move to the application layer where business or domain rules are applied. The application layer then talks to data storage layer to read/write/update/delete data. And the communication goes backward to respond to the user. MVC (Model View Controller) design pattern is based on a layered architecture approach.
The advantages of this pattern are the simplicity and separation of concern among layers, but the tight coupling of layers, scalability and number of layers are the threats.
Example: Apache Hive
The user submits the Hive query through the query layer (or the User Interface — Layer 1), query then lands in processing layer (Layer 2) where MapReduce jobs are created, the MapReduce layer contacts the data layer (e.g. HDFS — Layer 3) to get the data, and then computations are carried out, and passed back to the user.
It also follows the layered architecture as well. It has layers for user interface, data processing, and data connectors.
Actor-reactor architectural pattern works like a machine which has a controller unit (Actor) and execution units (Reactor). The controller unit is the decision maker, and the execution units are the decision executors. The controller unit is usually one, and there can be many executors or reactors as shown in the diagram below. Simply put, actor is the commander and reactors are the command executors.
In actor-reactor architecture, actor is the delegator of the tasks. The advantages are the scalability and performance, however this creates another problem known as the Single Point of Failure (SPoF). According to SPoF, if the actor fails, the entire process fails.
Example: Apache Spark
Spark is developed using actor-reactor architecture. The driver node is the actor there and the worker nodes are the reactors. The worker nodes are pulled to complete the job submitted to the driver by the user.
HDFS is also an example of actor-reactor architecture. Name-node is the actor and data-nodes are reactors. One interesting point to note here is the secondary name-node which comes into play in case the primary name-node fails. Hence, HDFS has a defense mechanism against single point of failure. The secondary name node makes it fault-tolerant.
Micro-kernel or plugin architecture
Micro-kernel or plugin architecture is an architectural pattern where the core functionality is separated from the add-ons or plugins. This architectural pattern can be related to pizza preparation process. The pizza base (or pizza dough) is usually prepared separately and kept ready as it is the core element of pizza. The additional layers, toppings, and so on are added based on customer requirements. In the micro-kernel or plugin architecture, the core (or the micro-kernel) is the main component and plugins are added by users to implement the required functionality. One important point to note here is that the micro-kernel should be small, i.e. it should be minimalist, and shipped with appropriate fundamental functionality and ability to collaborate and coordinate with any of the plugins.
Plugin architecture simplifies software development, enhancement and shipment as core and plugins can be developed and shipped independently. The issue in this style is the excessive use of plugins. If a large number of plugins are used then the software may become more resource intensive.
Example: Apache Spark
Apache Spark is a good example of Micro-kernel architecture in data engineering. The Spark core is the micro-kernel, and Spark SQL, Spark Streaming, GraphX, and Spark ML lib are the plugins which leverage Spark core.
Docker or containerisation can also be quoted as another example of Micro-kernel architecture. The developers start creating an application using a base image, and then they keep adding additional requirements to get to the desired state to develop the final product.
Micro-service architectural pattern is another architectural style where a technology/product/application/software is developed using a set of services which are smaller, loosely coupled, independent and maintained separately. A product/software could be divided into micro-services based on various attributes, for example the capabilities or functions, domain or sub-domain, etc.
At very high level, on an e-commerce website, one service could be managing the search page, second one could be managing product page, third one could be doing checkout related activities and fourth one could be used for user authentication. And each of the high level services could be further divided into micro-services which are responsible for a set of activities on these services. There are multiple advantages of creating the micro-services, for example:
- Smaller code base
- Easy to develop and manage
- Lower resource requirements
The interaction of the micro-services are mostly using Application Programming Interfaces (APIs), i.e., HTTP(S) and REST based. The problems with micro-services are a) communication between the micro-services, b) managing micro-services when they increase in numbers. The architectural pattern may appear as following.
- At Expedia Group™, the payment service is independent of the authentication service. The chat service is independent of both of the others.
- At Amazon, the product recommendation service should be independent of the service which checks delivery at a zip-code. And these are possibly two separate (micro)services.
- At Netflix, the video analytics service is independent of the video streaming service. And both are important for business of Netflix.
At Expedia Group™, data engineering teams offer the insights to business and other users. The insights are focussed on travelers, inventory, business transactions, etc. The data is brought into the system using micro-services and then it is processed using micro-services to offer insights. Function as a Service (FaaS) is also used to develop micro-services, discussed in the next section.
As the name goes, in this architectural pattern, the event or action or message is the reason behind the design paradigm. The main constituents of this architectural pattern are following.
- Emitter, is the one which triggers the event or performs any action or transmits the message. Emitter is also called as producer or publisher.
- Channel, is the message carrier or message store. Channel is also termed as server, broker, or router. The channel have the capability to retain the events as well.
- Consumer, is the message reader or processor. Consumer has another names as well, sink, subscriber are common.
When the shoppers shop on an e-commerce website or the mobile app, they do some activities like searching a product, viewing the product, buying the product, etc. These activities are also called as the actions done by the shoppers. These actions or events are passed to the e-commerce business over APIs. The e-commerce company is thus the consumer of the events. The simplified architecture looks as following.
The advantages of this architectural pattern include a) decoupling, b) scalability, c) agility, d) optimized costs as everything is on demand here, e) availability, and f) performance. However, the loose coupling and asynchronous behavior makes this architectural pattern inconsistent, difficult to test and maintain. The architectural pattern is best suited when emitters don’t wait for the consumers. As soon as the emitters send the message, they believe that the message is definitely going to reach the consumer at some point in time. Today, there are many technologies which follow this paradigm.
It is a queue based technology where publishers send messages to a queue which are then read by the consumers. Once a message is read by the consumer, and the acknowledgement is performed with the queue, the message is deleted from the queue.
Example: Apache Kafka
Apache Kafka is a well known project under Apache suite of projects. Rather than a queue based system, Apache Kafka is a log based system, which means as soon a message arrives in the broker, the message is retained in Apache Kafka as per the retention policy. Uber’s payment processing system is also a great example of processing online payments using Apache Kafka.
FaaS is also a good fit in the event-driven architectural pattern. First myth which needs to be busted is that server-less doesn’t mean that compute is without servers. Servers are there, but they are on-demand. As per server-less processing, as soon as an event needs to be processed, an application or function can be turned on to do processing, and it may be shut down after it is done.
With this research, it can be concluded that when a data platform or product is being planned, developed, or refined, the data technology to be used is determined by customer expectations (why? ), data input (what? ), data processing (how? ), and data access (where?). Technology factors such as coupling, scalability, availability, fault-tolerance, performance, dependability, and self-healing capabilities are also vital when creating fantastic data platforms and products.
Enhance your knowledge
- What is architectural style of Apache Airflow?
- What is architectural style of Presto?
- Compare the architecture of Presto with Apache Hive.
- What is eventual consistency?
- Differentiate between failure and variable latency.
- What interview questions do you think would be most relevant for Data Engineering roles at Expedia Group™?