Optimizing a multi-speaker TTS model for faster CPU inference — Part 2
In this second part of the blog, I’ll explain in detail how we chose the previously mentioned ONNX settings for intra- and inter-operator parallelism.
Using the parallelism settings in a smart way:
This section contains most of the meat of this work. I’ll explain how, just by changing one setting of ONNX parallelism smartly, we can drastically reduce latency. Needless to say, this experiment turned out to be the most successful one, since it lead to the lowest RTF and latency. I’ll jump straight to it, the experiment consisted in something as simple as adding this to the session options of the ONNX runtime:
sess_options.intra_op_num_threads = args.intra_op
Where
args.intra_op = 20
Like the rest of them, this experiment was ran with 20 CPUs and these are the results:
Total time (10 executions) (s): 165.2652 seconds
Average execution time per inference (s): 16.5265 seconds
Standard deviation of execution times (s): 0.6891 seconds
RTF: 0.09
Really impressive, right?
Now, I’ll explain why we decided to set the number of intra-operator threads to match the number of CPUs available for the experiment. First of all, take a look at this results from our merged ONNX model, which shows the average number of operators per layer:
Matxa
Total number of nodes: 1,30E+04
Total number of layers: 9,51E+03
Average number of nodes per layer: 1,37E+00
AlvoCat
Total number of nodes: 2,72E+02
Total number of layers: 2,21E+02
Average number of nodes per layer: 1,23E+16
You can check this yourself by running this script:
https://gist.github.com/mllopartbsc/eb73a889d3360c10f5dcc7e71241400e
Before I continue with the explanation, let me introduce the concepts of intra and inter operator threads. In the context of neural networks(NNs), these concepts belong to parallelism strategies used to optimize the execution of NNs operations on multi-core processors. The goal, obviously, is to achieve better performance by efficiently utilizing the available computational resources.
Intra-Operator Parallelism
Intra-operator parallelism (also known as operator-level parallelism) involves parallelizing the execution of a single operator across multiple threads. Each operator (e.g., matrix multiplication, convolution) can be broken down into smaller tasks that can be processed simultaneously. This is particularly useful for large operations where a single operator dominates the computational cost.
Inter-Operator Parallelism
Inter-operator parallelism (also known as graph-level parallelism) involves running different operators concurrently on different threads. This approach takes advantage of the computational graph structure of NNs, where some operations are independent and can be executed in parallel.
To sum it up:
- Intra-operator threads: run tasks inside an operator in parallel.
- Inter-operator threads: allows parallel operators to run concurrently.
Now, with this information, take a look at Figure 1.
On the left side and middle of Figure 1, you can see two examples of the same neural network, presented as a graph. In the leftmost graph, the network is being ran synchronously, meaning that the computations of an operator only start when the computations of the previous operator have finished. The graph in the middle presents a different situation, where the network is ran asynchronously, meaning that operators independent to each other are executed in parallel. Then, on the right side of Figure 1 you can see a pool of 32 CPUs. The grid at the bottom right of Figure 1 is a visualization of what happens when you set the configuration for inter operator threads to 4: with a pool of 32 CPUs, now each inter-operator thread can use 8 CPUs.
The goal is to be able to parallelise the execution of this network as much as possible, so that each CPU doesn’t waste any time in synchronization and spends most of the time executing workload. With this goal in mind, we need to strike a balance between the number of intra and inter operator threads assigned to run the network. As you can see, the example network has an average number of operators per layer of 1.6 (8 operators / 5 layers). Given that, I’ll list some possible scenarios and wheter or not they are best suited for the task:
1- To assign the number of intra-operator threads to be 32 and the number of inter-operator threads to be 1:
This option is wrong because the average number of operators per layer is 1.6. This means that, in most layers, the number of operators is closer to 2 than to 1. Hence, If we only create intra-operator threads, the network will run synchronously and we won’t be able to parallelise the execution of operators within the same layer. In short, there would be 32 CPUs executing a single operator when you could’ve had 16 CPUs with one and 16 CPUs with the other one. Because of that, this is the wrong option. ❌
2- To assign the number of intra-operator threads to be 1 and the number of inter-operator threads to be 32:
This option is wrong because you’d have just one cpu running one operator and another single cpu running the remaining operator from the same layer, and 30 CPUs waiting. ❌
3- To assign the number of intra-operator threads to be 32 and the number of inter-operator threads to be 32:
In this case, even though you’d have the best configuration for every possible scenario, the fact that you are creating more threads than cores available will create a huge synchronization overtime. This effect is called oversubscription, and it unfortunately leads to slower processing due to the calculations needed to change threads for each scenario. ❌
4- To assign the number of intra-operator threads to be 16 and the number of inter-operator threads to be 2:
This is the correct answer. Since most of the time you’ll be computing a layer that has close to 2 operators, it makes sense that you use two inter-operator threads, with 16 intra-operator threads each, for a total of 32 threads. That way, you avoid the effects of oversubscription while also offering the best configuration for the most common scenario. ✅
Now, let’s go back to our own case, which would look like figure two:
Since the average number of operators per layer for both models is <1.5 , it only makes sense to run the network sequentially, that is, without using any inter operator threads. By doing this, all the computing resources can focus on performing internal operations within each layer’s nodes, which is precisely what intra-operator threads do. Knowing that, it only makes sense to assign all CPUs to intra-operator threads. By doing that, we can eliminate all the synchronization overtimes, an achieve good results like the ones that I’ve just shown. For more information about this subject, please check this paper:
Another interesting insight we obtained from this experiment, was to see that adding more CPUs (and therefore more intra-operator threads) does not always lead to reduced latency. Here are the results with 40 CPUs and 60 CPUs, with the same amount of intra-operator threads:
40 CPUs:
Total time (10 executions) (s): 143.5563 seconds
Average execution time per inference (s): 14.3556 seconds
Standard deviation of execution times (s): 0.7709 seconds
RTF: 0.079
60 CPUs:
Total time (10 executions) (s): 166.9228 seconds
Average execution time per inference (s): 16.6923 seconds
Standard deviation of execution times (s): 0.2512 seconds
RTF: 0.092
As you can see, 40 CPUs does actually lead to an improvement in the performance, which makes sense because you have more compute available for internal operations within nodes. However, at 60CPUs the model performs even worse than with 20CPUs. Most likely, this is happens due to the effects of hyperthreading:
Hyperthreading can cause overhead in inference because it introduces context-switching between threads, which consumes CPU cycles and can lead to inefficiencies. Previously, we called that synchronization. Additionally, resource contention among threads can result in increased latency and suboptimal performance for inference tasks
Additionally, I will present the experiment that lead me to confirm all of the previously mentioned conclusions. For efficiency purposes, I did this experiment with a shorter sentence, which reduced the inference times. The results are in this google sheets table, which you can access with this link:
In the table , you can see the inference times in seconds for different configurations of the ONNX session. In the first column are the inter-operator threads, whereas in the first row are the intra-operator threads, both numbered from 1 to 20. Taking into account a standard deviation of about 0,3 seconds, you can see that the best results are with 20 intra-operator threads. Then, adding inter-operator threads leads to no effect. With these results, I formulated the reasoning presented above.
When the ONNX runtime is trying to execute the network with the default configuration, it loses latency in the process of changing thread settings. This is also called the synchronization of the threads or synchronization overtime. Because of that, it works worse than with the 20 intra-operator threads and 1 inter-operator thread. Essentially, it would be something similar to having the settings be 32 intra-operator and 32 inter-operator threads.
Hopefully, this explanation was clear enough so that you could understand why we decided to choose these settings. Again, thank you for your attention! 😊
Here’s the link for part 3:
And here’s the link for part 1:
All the the resources developed within the Aina project (projecteaina.cat) are available at the Aina Kit (projecteaina.cat/ainakit), an open open-source repository where users can access and download the generated technologies.
This work/research has been promoted and financed by the Government of Catalonia through the Aina project.