MuleSoft — The Core Execution Engine

What is at the heart of Mule 4.x?

Published in

The Mule Blog

11 min readJun 4, 2020

Most of us may feel that this is a cliched question, as we all know that Mule runs of Java (JAVA 8 only). But it is not the programming language base that Mule is all about. The core operating model has lots of intricacies and over the years there has been lots of optimizations that mule runtime has gone through which enables to provide solid performance.

In this article I am going focus on the core runtime model that Mule 4.x implements at a high level and what new options have become available when it comes to runtime tuning.

While I write this article, MuleSoft has just released version 4.3 a few weeks bug with plenty of fixes and new features added to it.

The Core Engine

Be it CloudHub or On Premise customer hosted deployment, the mule runtime is where your application code executes, and since its very inception the theme around which the software was built was event driven architecture. Be it whatever flow that you build, you use listeners (HTTP, VM, Scheduler) which acts as trigger points for the events to kick start your business flows.

With release of Mule 4, the core engine introduced the reactive programming paradigm into the runtime (thanks to JAVA 8 Reactive and NIO libraries), thus combining the already established concurrency framework with event-based and asynchronous systems.

The detailed white paper on reactive programming implementation in Mule can be found here

What reactive pattern brings in is a melange of best practices of different patterns like Observer, Iterator, Proactor(we will discuss a bit) etc. It is a programming paradigm that is asynchronous, non-blocking and declarative that enables working with stream of data rather than one single all in-memory collections as was the case in Java’s old school imperative programming approach.

Mule Runtime has been designed and coded in such a way that the event processors in the flow can dictate what kind of threading profile is required for a particular operation, and accordingly the runtime is intelligent enough to switch the thread context (or not to) likewise. Unlike Mule 3 applications which depended on SEDA queue implementations and dedicated thread pool per flow and developers had to configure the flow processing strategy, Mule 4 does not need any such configurations and they can automatically infer the execution type of each event processor on the run.

In the next few sections lets deep dive a little into the advanced features that have been brought into place by MuleSoft. We shall also look into how Mule 4.3 introduced some disruptive changes in the runtime engine.

Non-blocking IO

This is one of the biggest feature that got inherited with the implementation of reactive pattern. Henceforth, threads do not block waiting for IO intensive operations to complete thanks to Java 8 NIO/NIO2 libraries. This makes Mule runtime comparable to some of other programming languages like Go and NodeJS, albeit, Java and consequently Mule still is a multi threaded language.In the next few paragraph we will try to understand how non-blocking IO works in Mule core.

We will see in some time that Mule 4 maintains three distinct thread pools. Whenever a flow is executed, the threads are pulled out from each of the pools and returned back after use to the pool. There may be situations wherein a particular component in your flow wants to make IO call to external webservice and waits for the response to comeback. This is what is called a blocking operation.

Problem with blocking IO — In Mule 3, there were no such separate thread pools and each incoming request was allocated a thread from the pool or request is placed on a SEDA queue in case of asynchronous processing. Also the number of threads active and other parameters were configurable at flow and connector level in Mule 3, but Mule 4 have removed all such options.

Since it was only one thread which handled the entire flow operation, if there was a blocking IO call, the thread had to wait until response comes back.The below image shows eight request were concurrently made to the system.

If we assume that each blocking IO call requires 500ms on average to be executed and pool had 5 threads available, then the system will only be able to entertain 5 parallel requests within the 500ms time frame. Below picture depicts that all the 5 requests are stuck at HTTP request connector stage and remaining 3 HTTP request are queued in waiting state since there are no threads to address them.

All the 5 request get blocked on HTTP Request call, incoming request keep waiting

Solution — Mule 4 (and it is actually Java 8) circumvents this situation by providing a separate reserved custom thread pool called NIO Selector Pool. In case there is a blocking IO call, the thread which was executing the process is released immediately back to its own pool so that it can perform other operations, and the blocking operation is delegated to these selector pools (These threads belong to the OS kernel and schedule management is taken care of by the underlying OS, more the number of base cores better will be the processing).

Take a look at the below diagram, just two selector thread is capable of handling multiple channels on which the IO calls are executed in a multiplexed manner for each of the 5 incoming request. Therefore it need less threads to handle the channels which is a boon as switching between the threads is expensive for operating system, which in turn improves the overall throughput of the system.

IO calls delegated to selector channels, other requests are ready to addressed by main threads

The above picture is very high level to depict just NIO, as we shall see later that Mule has different types of thread pools.

Once the job is complete it intimates the main process in form of a callback that the data is available in the selector channel buffers and again a thread is allocated from the Mule thread pools and the flow execution proceeds to next component in the chain. In this way the overall throughput increases many fold due to the highly reactive nature as more system is able to cater to more and more traffic.

One of the NIO channel completes and notifies it the selector pool, which then reallocates the request to the main thread pool where already lots of threads are available.

Thread Pools & Processing Models

So we know now how Reactive programming helps Mule achieve non-blocking IO. In my previous diagrams of thread pool I simplified it by showing a single thread pool from where threads are spawned. Mule 4, infact has 3 distinct pools. These pools are managed by the core runtime engine and cannot be tuned in the way they could have been in done Mule 3. Depending on the type of processor the mule event in passing through, the engine switches the context and uses threads from respective pools.

CPU Light — Components that use this thread pool are the once that perform very nominal operations and require minimal resource consumption. For eg: Logger, filtering and message router components.This pool also performs two more important jobs — inter processor event handoff within the flow and hand off from NIO selector pools after the blocking IO calls complete.
CPU Intensive — This thread pool is used by components that may execute time consuming operations that uses lots of CPU/memory footprint. Dataweave is one such component. Some scripting components like groovy can also use this pool.
Blocking IO Pool (this is different from the Non-blocking NIO selector pool) — This pool used by components that perform blocking IO operations like database call or running a local/XA transactional event flow within trancsactional scope. This tasks are one’s that make a thread wait till the process is complete and do not perform any CPU processing, so that they do not compete with the work of the other pools.

Any mulesoft SDK developer can also specify the type of threading profile the connector they are developing will be using by explicitly using BLOCKING, CPU_INTENSIVE and CPU_LITE annotations.

Mule 4 uses scheduler execution service under the hood, the pools are auto configured depending upon the memory and CPU resources are available on the system (VM, bare metal server) where the Mule Runtime agent in running on. Apart from the above 3 distinct thread pools there are couple of other reserved pools — NIO Selector pool and recurrent task pools which are created by some connectors specifically on adhoc basis for completing the repeatable tasks.

Selector pools are are used extensively by HTTP connectors as discussed previously (HTTP uses Java Grizzly libraries internally). HTTP Request component have dedicated selector pools for every application running on the same Mule runtime, whereas HTTP Listener component have to share the selector pool for any number of applications running on the same mule runtime.

The above mentioned scenario is applicable for on-premise clustered deployment, on CloudHub it is always one application in one runtime and on a single VM/worker.

Proactor pattern— Mule follows this popular design pattern to switch context of thread and perform concurrent execution. It segregates all the task that will be executed in the flow into respective categories and then assigns required thread pools to each of those categories. Let us take take an example code snippet.

Step by step if we assess the components these will the thread pools assigned and context switches taking place:

The HTTP Listeners, as discussed, will be allocated Shared NIO Selector Pool.
It then switches to CPU_LITE thread pool for the Logger component.
HTTP Request connector follows next and it uses the dedicated NIO Selector pool to perform a non-blocking IO on the current (CPU_LITE) thread. In the meantime, the CPU_LITE thread will be released back to the pool. Once the response comes back to, a new CPU_LITE thread is allocated to re-assign the request back (This process was discussed at a high level in Non-Blocking IO section).
The next processor in the chain is a dataweave which will need a switch CPU_INTENSIVE thread pool.
Once the transformation is done, a context switch to Blocking IO pool is made as there is database call which will require the thread to wait until response comes back.
The last processor in the chain is Logger component which usually would require a switch to CPU_LIGHT thread pool, but the runtime may decide to stick to Blocking IO pool to execute the logger to optimize time taken to switch thread context.

All the inter processor hand-offs are performed internally by CPU_LITE thread pool as discussed before.

Thread pool assignment — Here is a rough estimate as to how much threads are assigned to each of the pools depending upon the resources available. The pool configurations are dependent on number of CPU cores (vCores on CloudHub). If we consider a CloudHub worker with 4 vCores CPU/2GB memory then following will the thread pool sizes:

Worker thread pool config for a 4vCore CloudHub worker (*mem — RAM memory)

Developers have option to tune this config and override them in on-premise deployments by modifying the scheduler-pools.conf file. CloudHub users do not have access to this file, as the runtime is managed entirely by MuleSoft. There is although an option to override the configurations at application level by adding following sample snippet in the application xml file.

<ee:scheduler-pools gracefulShutdownTimeout="15000">
   <ee:cpu-light
           poolSize="8"
           queueSize="1024"/>
   <ee:io
           corePoolSize="6"
           maxPoolSize="12"
           queueSize="0"
           keepAlive="30000"/>
   <ee:cpu-intensive
           poolSize="4"
           queueSize="2048"/>
</ee:scheduler-pools>

Applying pool configurations at the application level causes the runtime to create a completely new set of thread pools for the deployed application, but this configuration does not change the default settings configured in the scheduler-pools.conf file.

As per Mule documentation, if you define pool configurations at the application level for Mule apps deployed to CloudHub, then you should be mindful about worker sizes because fractional vCores have less memory

Backpressure management

This is another cool feature that comes with Mule 4 owing to its reactive architecture. It is a way in which a consumer(subscriber) can notify a producer to slow down on the amount of event stream that is being generated by them as the consumer system does not have any threads/resources to process them.

Backpressure management is very important feature in Mule and is automatically managed by the runtime, which allows it to auto-tune processing massive amount of data. Manual management is also possible for connectors which has maxConcurrency attribute. Developers can configure them accordingly so that whenever thresold is reached in terms of concurrent event processing, the flow would stop processing resource and start logging errors.

The flow processor also has maxConcurrecy attribute (shown below), but configuration may be ignored by the runtime under certain situations.

What’s the Uber Thread Pool ?

As I mentioned before, Mule recently released 4.3 and with came a disruptive change in its architecture with the introduction of Uber Pool. We saw before that in Mule 4.1/4.2, there were 3 distinct thread pools, but they are now merged into one single pool which contains thread pools of all different categories — CPU_LIGHT, CPU_INTENSIVE and BLOCKING_IO. The Uber Pool is auto managed by Mule runtime and shared across all the apps running on the Mule runtime.

There is one major change with transaction management, all processors in transactional scope will not undergo any thread context switch in version 4.3 and will be executed on a single thread only.

As per documentation, the single thread pool will allow Mule to be efficient, requiring significantly fewer threads (and their inherent memory footprint) to process a given workload when compared to Mule 3.

In terms of compatibility, the Uber pool strategy is backward compatible as this has no impact in the application’s behavior. In the event of unforeseen corner cases, or if some fine tuning customization have been done, Mule runtime engine can always be configured to go back to the previous threading model.

By default, Mule 4.3 is configured to run in UBER pool mode, but can be configured to run in old DEDICATED (3 separate thread pools) mode too by updating the scheduler-pools.conf file.

Sample Uber pool configuration looks like this in scheduler-pools.conf file:

Uber Pool default configuration

Sample DEDICATED pool configuration looks like this in scheduler-pools.conf file:

Dedicated(Separated) Pool default configuration

For application level configuration (required in case you need to tune CloudHub based worker), following is the sample fragment, which will spawn a new and specific thread pool for the application. This does not override the pool described in the scheduler-pools.conf file.

<ee:scheduler-pools poolStrategy="UBER" gracefulShutdownTimeout="15000">
   <ee:uber
       corePoolSize="1"
       maxPoolSize="9"
       queueSize="5"
       keepAlive="5"/>
</ee:scheduler-pools>

MuleSoft recommends not to use any custom threading profile and keep the UBER pooling strategy mode on as it has very low memory footprint and hence better throughput and performance, based on load and stress testing done by them.

That’s all for now, will come with more core fundamentals in future.

References

Reactive programming with Mule 4 | MuleSoft

Mule 4 addresses vertical scalability with a radically different design, using reactive programming at its core. A…

www.mulesoft.com

Mule Runtime 4.3.0 Release Notes

Common Subexp Elimination To optimize scripts, DataWeave now eliminates common subexpressions by internally using…

docs.mulesoft.com

Tuning Performance

A Mule application is a collaboration of a set of flows. Conceptually, messages are processed by flows in three stages…