Java Parallel Streams: The Speedup Myth and When It Actually Works

6 min readApr 7, 2024

In this article we’re going to see how Parallel Streams in Java can help us improve performance of many types of data processing operations. To unleash the power that Parallel Streams can bring in a Java applicationI have chosen to use a concrete example of data processing, which is image processing. After presenting the example, we’re going to make some adjustments to the initial example and eventually change the type of data processing itself, and see for ourselves how this power diminishes in such cases. That will enable you to understand that, despite the great potential Parallel Streams have, they shouldn’t be relied on blindly. That’s because they are not a silver bullet to solving all performance problems regarding a collection of data, and might even perform worse if no proper analysis and testing is done beforehand.

What are Java Streams?

Since its introduction in Java 8, the Stream API has brought a new programming paradigm in Java. That is the processing of data in a declarative fashion. When you process data in a declarative way, you define what should be done to the values instead of how. That leads to more readable code. In addition to readability, Streams have brought other benefits as well. For example, the Immutability in Streams makes sure that the processing of data is done in such a way as to not modify the original (source) collection. This makes your code more predictable and easier to reason about, avoiding issues with unintended side effects. Another benefit of using streams comes down to their Laziness, meaning they are not executed until a terminal operation (like finding a maximum value) is called. This can improve performance by optimizing intermediate operations. Last but not least, Streams have brought a new, convenient way of parallel processing, which is what this article is about.

Creating a (Parallel) Stream

Streams can be parallelized to leverage multiple cores on your machine, potentially leading to significant performance gains for certain operations on datasets. A Parallel Stream is split into multiple sub-streams that are processed in parallel by multiple instances of the stream pipeline being executed by multiple threads, and their results are combined to return the final result. While normal streams can be created by invoking the Collection.stream() method, parallel streams are created by invoking Collection.parallelStream() method. For example, the below parallel stream is created from an ArrayList of numbers. It chooses even numbers only, doubles them, and converts the result back to a list again:

listOfNumbers.parallelStream()
    .filter(n -> n % 2 == 0)
    .map(n -> 2*n)
    .collect(Collectors.toList());

Why use Parallel Streams?

As mentioned above parallel streams leverage multiple cores on your machine, and can potentially lead to performance increase of your data processing. Now, I am going to demonstrate their potential with concrete examples. Imagine a service that does image processing. The clients send batches of images to it and tell it to process them in a certain way. In our example, the service increases the contrast of the images. Let’s say on average each request is receiving a batch of 10,000 images to edit. The method that accepts the images and returns them as edited is shown below. Initially, we decided to write the method in a sequential fashion, using the classic Java for-loop:

public List<BufferedImage> batchIncreaseContrast(List<BufferedImage> images) {
    List<BufferedImage> editedImages = new ArrayList<>();
    for (int i = 0; i < images.size(); i++) {
        editedImages.add(ImageContrastService.increaseContrast(images.get(i)));
    }
    return editedImages;
}

I did some benchmarking by taking as an example the image of a dog, creating an ArrayList of 10,000 such images, and measuring the time it takes to complete the processing. After doing the same thing 10 times, I ended up with an average result of 58,936 milliseconds (58.9 seconds). In other words, each time a client sends a request to process an average batch of 10000 images to process, it takes almost 1 minute to complete the job. Then, I rewrote the same method for processing the images using Parallel Streams:

public List<BufferedImage> batchIncreaseContrast(List<BufferedImage> images) {
    return images.parallelStream()
        .map(ImageContrastService::increaseContrast)
        .collect(Collectors.toList());
}

Again, I did some benchmarking by running the code 10 times with the same dataset. I ended up with a mind-blowing average of 8,994 milliseconds. To put it another way, each time a client sends a request of 10,000 images to process, the service is able to complete the job in 9 seconds, as opposed to 1 minute with the previous approach: around 6.5 times faster. That’s a staggering performance gain by simply rewriting the method to use Parallel Streams.

Adjusting the data size

Next, I made some adjustments to the dataset size and the benchmarking results are pretty interesting. Specifically, I decreased the size of the dataset to 100 images first, and got the following average times to complete the job:

Classic Java for-loop: 659 milliseconds
Parallel Streams: 208 milliseconds

By decreasing the dataset by two orders of magnitude, the performance gain fell from 6.5x to 3.2x.

Then again, I decreased the data size from 100 to 10, and got the following results:

Classic Java for-loop: 105 milliseconds
Parallel Streams: 57.5 milliseconds

The performance gain is now 1.8x, which is pretty when you think about it, but much lower than what we started with (6.5x)!

Adjusting the computational intensity

This time, we are going to continue where we left off with the data size: keeping it at 10, but we are changing the type of data that we’re going to process. Think about a service that has to deal with filtering out prime numbers from a list of numbers. The usual batch size that it receives is 10, therefore we have done the benchmarking with a dataset of 10 random numbers (from 1 to 1000). At first, like before, I have written the method that handles this in a sequential fashion:

public List<Integer> filterOutPrimeNumbers(List<Integer> numbers) {
    List<Integer> nonPrimeNumbers = new ArrayList<>();
    for (int i = 0; i < numbers.size(); i++) {
        if (!IsPrimeUtil.isPrime(numbers.get(i))) {
            nonPrimeNumbers.add(numbers.get(i));
        }
    }
    return nonPrimeNumbers;
}

After running the code a couple of times, I ended up with an average of 0.15 milliseconds to complete the job.

Then, I rewrote the code using parallel streams:

public List<Integer> filterOutPrimeNumbers(List<Integer> numbers) {
    return numbers.parallelStream()
        .filter(number -> !IsPrimeUtil.isPrime(number))
        .collect(Collectors.toList());
}

And ended up with a surprising average of 3.5 milliseconds. While a difference of 3.35 milliseconds might not seem much, image the service working at scale (receiving millions of requests). The one written with parallel streams would be performing way worse than the one written leveraging the good old single-threaded for-loop.

Earlier, we saw the performance gain diminishing as the data size decreased, and now for the same data size we’re seeing the parallel streams perform worse than the for-loop. Why would the parallel stream’s performance vary this way, and how could it make sense that a set of data processed in multiple threads perform worse than in a single thread?

Scratching the surface of Parallel Streams internals:

In order to answer the questions above, we need to familiarize ourselves with how parallel streams work. Since the detailed mechanisms and internals of parallel streams are outside of the scope of this article, we are simply scratching the surface of their internals (enough for getting the point). When a Parallel Stream is created, it spends some time trying to figure out how to split the complete Stream into sub-streams (which it then executes in parallel). Now, if the dataset is not large enough, the time spent planning how to split plus the time it takes to process the data in parallel could be longer than simply iterating and processing the data directly. Similarly, if the computational intensity of the data processing is not very significant, then the time to plan how to split the stream plus the time it takes to process the computationally non-intensive tasks could be longer than simply processing them right away.

Conclusion

When trying to use (or avoid) parallel streams, there are other aspects to take into consideration as well (in addition to data size and computational intensity). Some of them are: the ease of splitability (array lists, arrays, and hashmaps are easily splittable, while Linked Lists are not), non-interfering behavior of the stream predicate, etc. I couldn’t touch on all of the aspects in my practical demonstrations, however, these are just as important aspects to take care of.

As a conclusion, parallel streams offer a unique opportunity for developers to parallelize their data processing (and potentially gain performance) without cluttering their code with error-prone concurrent programming logic. However, it is important to understand that there are no guarantees that executing a stream in parallel will improve performance. In fact, it could do just the opposite if you’re dealing with particular types of datasets.

Congratulations on making it to the end of the article!