Optimizing a multi-speaker TTS model for faster CPU inference — Part 2

7 min readAug 2, 2024

In this second part of the blog, I’ll explain in detail how we chose the previously mentioned ONNX settings for intra- and inter-operator parallelism.

Using the parallelism settings in a smart way:

This section contains most of the meat of this work. I’ll explain how, just by changing one setting of ONNX parallelism smartly, we can drastically reduce latency. Needless to say, this experiment turned out to be the most successful one, since it lead to the lowest RTF and latency. I’ll jump straight to it, the experiment consisted in something as simple as adding this to the session options of the ONNX runtime:

sess_options.intra_op_num_threads = args.intra_op

Where

args.intra_op = 20

Like the rest of them, this experiment was ran with 20 CPUs and these are the results:

Total time (10 executions) (s): 165.2652 seconds

Average execution time per inference (s): 16.5265 seconds

Standard deviation of execution times (s): 0.6891 seconds

RTF: 0.09

Really impressive, right?

Now, I’ll explain why we decided to set the number of intra-operator threads to match the number of CPUs available for the experiment. First of all, take a look at this results from our merged ONNX model, which shows the average number of operators per layer:

Matxa

Total number of nodes: 1,30E+04

Total number of layers: 9,51E+03

Average number of nodes per layer: 1,37E+00

AlvoCat

Total number of nodes: 2,72E+02

Total number of layers: 2,21E+02

Average number of nodes per layer: 1,23E+16

You can check this yourself by running this script:

https://gist.github.com/mllopartbsc/eb73a889d3360c10f5dcc7e71241400e

Before I continue with the explanation, let me introduce the concepts of intra and inter operator threads. In the context of neural networks(NNs), these concepts belong to parallelism strategies used to optimize the execution of NNs operations on multi-core processors. The goal, obviously, is to achieve better performance by efficiently utilizing the available computational resources.

Intra-Operator Parallelism

Intra-operator parallelism (also known as operator-level parallelism) involves parallelizing the execution of a single operator across multiple threads. Each operator (e.g., matrix multiplication, convolution) can be broken down into smaller tasks that can be processed simultaneously. This is particularly useful for large operations where a single operator dominates the computational cost.

Inter-Operator Parallelism

Inter-operator parallelism (also known as graph-level parallelism) involves running different operators concurrently on different threads. This approach takes advantage of the computational graph structure of NNs, where some operations are independent and can be executed in parallel.

To sum it up:

Intra-operator threads: run tasks inside an operator in parallel.
Inter-operator threads: allows parallel operators to run concurrently.

Now, with this information, take a look at Figure 1.

Figure 1 — Examples of Synchronous Scheduling (graph left), Asynchronous Scheduling (graph middle), and changing from one(table up) to four thread pools(table down) using the same quantity of hardware resources (CPUs). In this example, there are 32 CPUs.

On the left side and middle of Figure 1, you can see two examples of the same neural network, presented as a graph. In the leftmost graph, the network is being ran synchronously, meaning that the computations of an operator only start when the computations of the previous operator have finished. The graph in the middle presents a different situation, where the network is ran asynchronously, meaning that operators independent to each other are executed in parallel. Then, on the right side of Figure 1 you can see a pool of 32 CPUs. The grid at the bottom right of Figure 1 is a visualization of what happens when you set the configuration for inter operator threads to 4: with a pool of 32 CPUs, now each inter-operator thread can use 8 CPUs.
The goal is to be able to parallelise the execution of this network as much as possible, so that each CPU doesn’t waste any time in synchronization and spends most of the time executing workload. With this goal in mind, we need to strike a balance between the number of intra and inter operator threads assigned to run the network. As you can see, the example network has an average number of operators per layer of 1.6 (8 operators / 5 layers). Given that, I’ll list some possible scenarios and wheter or not they are best suited for the task:

1- To assign the number of intra-operator threads to be 32 and the number of inter-operator threads to be 1:
This option is wrong because the average number of operators per layer is 1.6. This means that, in most layers, the number of operators is closer to 2 than to 1. Hence, If we only create intra-operator threads, the network will run synchronously and we won’t be able to parallelise the execution of operators within the same layer. In short, there would be 32 CPUs executing a single operator when you could’ve had 16 CPUs with one and 16 CPUs with the other one. Because of that, this is the wrong option. ❌

2- To assign the number of intra-operator threads to be 1 and the number of inter-operator threads to be 32:

This option is wrong because you’d have just one cpu running one operator and another single cpu running the remaining operator from the same layer, and 30 CPUs waiting. ❌

3- To assign the number of intra-operator threads to be 32 and the number of inter-operator threads to be 32:

In this case, even though you’d have the best configuration for every possible scenario, the fact that you are creating more threads than cores available will create a huge synchronization overtime. This effect is called oversubscription, and it unfortunately leads to slower processing due to the calculations needed to change threads for each scenario. ❌

4- To assign the number of intra-operator threads to be 16 and the number of inter-operator threads to be 2:

This is the correct answer. Since most of the time you’ll be computing a layer that has close to 2 operators, it makes sense that you use two inter-operator threads, with 16 intra-operator threads each, for a total of 32 threads. That way, you avoid the effects of oversubscription while also offering the best configuration for the most common scenario. ✅

Now, let’s go back to our own case, which would look like figure two:

Figure 2 — Simplistic Graph visualization of the Matxa + Vocos ONNX model

Since the average number of operators per layer for both models is <1.5 , it only makes sense to run the network sequentially, that is, without using any inter operator threads. By doing this, all the computing resources can focus on performing internal operations within each layer’s nodes, which is precisely what intra-operator threads do. Knowing that, it only makes sense to assign all CPUs to intra-operator threads. By doing that, we can eliminate all the synchronization overtimes, an achieve good results like the ones that I’ve just shown. For more information about this subject, please check this paper:

Exploiting Parallelism Opportunities with Deep Learning Frameworks

State-of-the-art machine learning frameworks support a wide variety of design features to enable a flexible machine…

arxiv.org

Another interesting insight we obtained from this experiment, was to see that adding more CPUs (and therefore more intra-operator threads) does not always lead to reduced latency. Here are the results with 40 CPUs and 60 CPUs, with the same amount of intra-operator threads:

40 CPUs:

Total time (10 executions) (s): 143.5563 seconds

Average execution time per inference (s): 14.3556 seconds

Standard deviation of execution times (s): 0.7709 seconds

RTF: 0.079

60 CPUs:

Total time (10 executions) (s): 166.9228 seconds

Average execution time per inference (s): 16.6923 seconds

Standard deviation of execution times (s): 0.2512 seconds

RTF: 0.092

As you can see, 40 CPUs does actually lead to an improvement in the performance, which makes sense because you have more compute available for internal operations within nodes. However, at 60CPUs the model performs even worse than with 20CPUs. Most likely, this is happens due to the effects of hyperthreading:

Hyperthreading can cause overhead in inference because it introduces context-switching between threads, which consumes CPU cycles and can lead to inefficiencies. Previously, we called that synchronization. Additionally, resource contention among threads can result in increased latency and suboptimal performance for inference tasks

Additionally, I will present the experiment that lead me to confirm all of the previously mentioned conclusions. For efficiency purposes, I did this experiment with a shorter sentence, which reduced the inference times. The results are in this google sheets table, which you can access with this link:

Intra/Inter results

Full 1 Inter Intra ->,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19 …

docs.google.com

In the table , you can see the inference times in seconds for different configurations of the ONNX session. In the first column are the inter-operator threads, whereas in the first row are the intra-operator threads, both numbered from 1 to 20. Taking into account a standard deviation of about 0,3 seconds, you can see that the best results are with 20 intra-operator threads. Then, adding inter-operator threads leads to no effect. With these results, I formulated the reasoning presented above.

When the ONNX runtime is trying to execute the network with the default configuration, it loses latency in the process of changing thread settings. This is also called the synchronization of the threads or synchronization overtime. Because of that, it works worse than with the 20 intra-operator threads and 1 inter-operator thread. Essentially, it would be something similar to having the settings be 32 intra-operator and 32 inter-operator threads.

Hopefully, this explanation was clear enough so that you could understand why we decided to choose these settings. Again, thank you for your attention! 😊

Here’s the link for part 3:

Optimizing a multi-speaker TTS model for faster CPU inference — Part 3

In this final part of our blog, I’ll explain in a bit more detail some other optimization techniques that we tried…

medium.com

And here’s the link for part 1:

Optimizing a multi-speaker TTS model for faster CPU inference — Part 1

Our goal was to optimize🍵 Matxa, a Catalan multispeaker and multidialectal text-to-speech (TTS) model which uses 🥑…