Performance of pipeline architecture: how does the number of workers affect the performance?

Malith Jayasinghe
8 min readMar 26, 2019

--

Nihla Akram and Malith Jayasinghe

Introduction

With the advancement of technology, the data production rate has increased. In numerous domains of application, it is a critical necessity to process such data, in real-time rather than a store and process approach. When it comes to real-time processing, many of the applications adopt the pipeline architecture to process data in a streaming fashion. The pipeline architecture is a parallelization methodology that allows the program to run in a decomposed manner. The pipeline architecture consists of multiple stages where a stage consists of a queue and a worker. Each stage of the pipeline takes in the output from the previous stage as an input, processes it and outputs it as the input for the next stage. In this article, we will first investigate the impact of the number of stages on the performance. We show that the number of stages that would result in the best performance is dependent on the workload characteristics.

Background

The pipeline architecture is a commonly used architecture when implementing applications in multithreaded environments. We can consider it as a collection of connected components (or stages) where each stage consists of a queue (buffer) and a worker.

Figure 1 depicts an illustration of the pipeline architecture.

Figure 1 Pipeline Architecture

Let m be the number of stages in the pipeline and Si represents stage i. Let Qi and Wi be the queue and the worker of stage i (i.e. Si) respectively.

A new task (request) first arrives at Q1 and it will wait in Q1 in a First-Come-First-Served (FCFS) manner until W1 processes it. The output of W1 is placed in Q2 where it will wait in Q2 until W2 processes it. This process continues until Wm processes the task at which point the task departs the system.

One key advantage of the pipeline architecture is its connected nature which allows the workers to process tasks in parallel. This can result in an increase in throughput. As a result, pipelining architecture is used extensively in many systems. For example, stream processing platforms such as WSO2 SP which is based on WSO2 Siddhi uses pipeline architecture to achieve high throughput.

There are several use cases one can implement using this pipelining model. For example, sentiment analysis where an application requires many data preprocessing stages such as sentiment classification and sentiment summarization. Furthermore, the pipeline architecture is extensively used in image processing, 3D rendering, big data analytics, and document classification domains.

Experiment Details

This section provides details of how we conduct our experiments. The workloads we consider in this article are CPU bound workloads. Our initial objective is to study how the number of stages in the pipeline impacts the performance under different scenarios. We use the notation n-stage-pipeline to refer to a pipeline architecture with n number of stages. To understand the behaviour we carry out a series of experiments. The following are the parameters we vary

  1. The number of stages (stage = workers + queue)
  2. The service time/processing time
  3. The arrival rate (into the system)

We conducted the experiments on a Core i7 CPU: 2.00 GHz x 4 processors RAM 8 GB machine. We use two performance metrics to evaluate the performance, namely, the throughput and the (average) latency. We define the throughput as the rate at which the system processes tasks and the latency as the difference between the time at which a task leaves the system and the time at which it arrives at the system. When we compute the throughput and average latency we run each scenario 5 times and take the average.

We implement a scenario using pipeline architecture where the arrival of a new request (task) into the system will lead the workers in the pipeline constructs a message of a specific size. We consider messages of sizes 10 Bytes, 1 KB, 10 KB, 100 KB, and 100MB.

Let us now explain how the pipeline constructs a message using 10 Bytes message. Let us assume the pipeline has one stage (i.e. 1-stage-pipeline). A request will arrive at Q1 and it will wait in Q1 until W1processes it. Here the term “process” refers to W1 constructing a message of size 10 Bytes. When the pipeline has 2 stages, W1 constructs the first half of the message (size = 5B) and it places the partially constructed message in Q2. W2 reads the message from Q2 constructs the second half. When there is m number of stages in the pipeline each worker builds a message of size 10 Bytes/m. We note that the processing time of the workers is proportional to the size of the message constructed. Taking this into consideration we classify the processing time of tasks into the following 6 classes

When we measure the processing time we use a single stage and we take the difference in time at which the request (task) leaves the worker and time at which the worker starts processing the request (note: we do not consider the queuing time when measuring the processing time as it is not considered as part of processing). As a result of using different message sizes, we get a wide range of processing times. For example, class 1 represents extremely small processing times while class 6 represents high processing times.

The impact of processing time on the performance

Let’s first discuss the impact of the number of stages in the pipeline on the throughput and average latency (under a fixed arrival rate of 1000 requests/second). The following figures show how the throughput and average latency vary under a different number of stages.

We clearly see a degradation in the throughput as the processing times of tasks increases. Similarly, we see a degradation in the average latency as the processing times of tasks increases. We expect this behaviour because, as the processing time increases, it results in end-to-end latency to increase and the number of requests the system can process to decrease.

Let us now take a look at the impact of the number of stages under different workload classes. The following table summarizes the key observations.

Let us now try to reason the behaviour we noticed above. It is important to understand that there are certain overheads in processing requests in a pipelining fashion. For example, when we have multiple stages in the pipeline there is context-switch overhead because we process tasks using multiple threads. The context-switch overhead has a direct impact on the performance in particular on the latency. In addition, there is a cost associated with transferring the information from one stage to the next stage. Transferring information between two consecutive stages can incur additional processing (e.g. to create a transfer object) which impacts the performance. Moreover, there is contention due to the use of shared data structures such as queues which also impacts the performance.

When it comes to tasks requiring small processing times (e.g. class 1, class 2), the overall overhead is significant compared to the processing time of the tasks. Therefore, there is no advantage of having more than one stage in the pipeline for workloads. In fact for such workloads, there can be performance degradation as we see in the above plots. As the processing times of tasks increases (e.g. class 4, class 5 and class 6), we can achieve performance improvements by using more than one stage in the pipeline. For example, we note that for high processing time scenarios, 5-stage-pipeline has resulted in the highest throughput and best average latency. Therefore, for high processing time use cases, there is clearly a benefit of having more than one stage as it allows the pipeline to improve the performance by making use of the available resources (i.e. CPUs cores)

The impact of arrival rate on the performance

In the previous section, we presented the results under a fixed arrival rate of 1000 requests/second. This section discusses how the arrival rate into the pipeline impacts the performance. Here we notice that the arrival rate also has an impact on the optimal number of stages (i.e. the number of stages with the best performance). The following figure shows how the throughput and average latency vary with under different arrival rates for class 1 and class 5.

We note from the plots above as the arrival rate increases, the throughput increases and average latency increases due to the increased queuing delay. Let us now try to understand the impact of arrival rate on class 1 workload type (that represents very small processing times). We note that the pipeline with 1 stage has resulted in the best performance. As pointed out earlier, for tasks requiring small processing times (e.g. see the results above for class 1) we get no improvement when we use more than one stage in the pipeline. Here we note that that is the case for all arrival rates tested. In the case of class 5 workload, the behaviour is different, i.e. the number of stages that would result in the best performance varies with the arrival rates.

Conclusion

In this article, we investigated the impact of the number of stages on the performance of the pipeline model. We showed that the number of stages that would result in the best performance is dependent on the workload characteristics. The following are the Key takeaways

  • The number of stages that would result in the best performance in the pipeline architecture depends on the workload properties (in particular processing time and arrival rate)
  • If the processing times of tasks are relatively small, then we can achieve better performance by having a small number of stages (or simply one stage).
  • Using an arbitrary number of stages in the pipeline can result in poor performance.
  • Dynamically adjusting the number of stages in pipeline architecture can result in better performance under varying (non-stationary) traffic conditions

--

--

Malith Jayasinghe

Software Architect, Programmer, Computer Scientist, Researcher, Senior Director (Platform Architecture) at WSO2