Optimizing a multi-speaker TTS model for faster CPU inference — Part 3
In this final part of our blog, I’ll explain in a bit more detail some other optimization techniques that we tried, even if unsuccessfully.
Trying Deepsparse:
Originally, the idea was to use Deepsparse to optimize the models for inference. However, after some successful attempts with static inputs, where the engine was complied and ran with inputs of the same size, I realized that the Engine doesn’t accept dynamic inputs. It’s a pitty because it was running as good as some of the best parallel executions with ONNX. In spite of that, Deepsparse was discarded as an option for optimization, as it would’ve been slower than other options. Nonetheless, it can be a viable option in the future, since Neural Magic has stated that they plan to implement dynamic inputs in the future.
Does the simplifier work?
After the success of the previous experiment, I tried using the onnx-simplifier library to see if reducing the model size and redundancies I could achieve better results for inference. Here’s the command I used to simplify the model for dynamic inputs:
onnxsim matxa_vocos_merged_HF.onnx matxa_vocos_merged_HF_simplified_dynamic.onnx --overwrite-input-shape model1_x:1,-1 model1_x_lengths:1 model1_scales:2 model1_spks:1
And here’s the results of the simplification:
https://gist.github.com/mllopartbsc/b03788c1c9c4365455607dcd15796ee1
However, inference times increased:
Total time (10 executions) (s): 220.0996 seconds
Average execution time per inference (s): 22.0100 seconds
Standard deviation of execution times (s): 3.3705 seconds
This might be due to a number of reasons:
- Model Structure Changes: The simplification process might alter the structure of the model in ways that are not optimal for the inference engine. Simplified models can sometimes include operations or sequences of operations that are less efficient for certain hardware or software environments.
- Graph Optimization Limitations: The simplifier aims to reduce redundancy and simplify operations, but these changes might interfere with optimizations that the inference engine could perform. For example, the engine might have specific optimizations for certain complex operations that are lost when those operations are simplified.
- Memory Access Patterns: Simplification can alter memory access patterns. If the simplification changes how data is accessed and stored, it might result in less efficient use of caches and memory bandwidth.
- Operation Fusion and Parallelism: Some inference engines perform operation fusion (combining multiple operations into a single, more efficient operation) and parallel execution. Simplifying the model might break these fused operations or disrupt parallel execution, leading to longer inference times.
However, if the limitation of the model comes from the memory usage, the simplifier library can be a good option to reduce model size.
Does pruning work?
As my last attempt, i tried with model pruning, which resulted in another unsuccessful optimization. Here’s the script I used for pruning:
https://gist.github.com/mllopartbsc/424216addc20c9495a7c4c310fc06540
The reduction in model size was of about 5 Kb. However, these are the inference results:
Total time (10 executions) (s): 179.1162 seconds
Average execution time per inference (s): 17.9116 seconds
Standard deviation of execution times (s): 2.3069 seconds
They are slightly worse than the non-pruned version. Some of the reasons why these results are very similar to those from the simplifier library:
- Irregular Sparsity: Pruning often results in sparse matrices. If the sparsity pattern is irregular, many inference engines may not be optimized to handle such sparsity efficiently, leading to increased inference times.
- Lack of Hardware Support: Many current hardware accelerators and inference engines are optimized for dense operations. Sparse operations, especially those with irregular sparsity, might not benefit from these optimizations and can even be slower due to the overhead of handling sparse data structures.
- Cache and Memory Access Patterns: In the same way as the simplified model, pruned models can have different memory access patterns compared to their dense counterparts. These new patterns can lead to suboptimal cache usage and increased memory access latency.
- Operation Overhead: Handling sparsity often involves additional overhead to skip zero values, which can outweigh the computational savings, particularly if the percentage of pruned weights is not high enough.
- Fragmented Computations: Pruning can fragment the computations, leading to less efficient execution. The benefits of parallel processing might be reduced because the computations are no longer as well-structured or evenly distributed.
- Graph Rewriting Overheads: Pruning changes the graph of the model. Some inference engines might struggle with the altered graph, leading to inefficient execution plans or suboptimal use of the available hardware.
Here’s the link for part 1:
And here’s the link for part 2:
All the the resources developed within the Aina project (projecteaina.cat) are available at the Aina Kit (projecteaina.cat/ainakit), an open open-source repository where users can access and download the generated technologies.
This work/research has been promoted and financed by the Government of Catalonia through the Aina project.