In my previous blog, I began exploring the xz utility. From my analysis of the perf report, I found that the
bt_find_function is being called maximum times, almost 40%. Inside the function itself, the hotspot seems to be a
cmp instruction which is checking if a register is 0.
My first idea is to try and use the bitwise
xor instead of the subtraction.
I had contacted the main dev for this project Lasse Collin regarding ideas on what to optimize. He gave me the suggestion to look into the code in the
memcmplen file. It is optimized for x86 to work for unaligned 8-byte access. Other architectures use 4-byte access. He advised looking into this function which will have a potential impact on compression.
The file mentioned above provides a function that compares two given buffers and returns the number of bytes that match using the
uint32_t datatype. The number returned is always between the number of bytes already compared and matching to the limit up to which to compare the buffers.
A first glance into the code suggests that intrinsics are being used to carry out all sorts of instructions like load, store, add etc. My goal will be to create an
elif directive that will be true if it is an ARM platform. I will be following the below mentioned guides to understand the given code in x86 and the equivalent or improved ones for AArch64 SIMD.
X86 Built-in Functions - Using the GNU Compiler Collection (GCC)
6.57.11 X86 Built-in Functions These built-in functions are available for the i386 and x86-64 family of computers…
ARM NEON Intrinsics - Using the GNU Compiler Collection (GCC)
Using the GNU Compiler Collection (GCC)
Using the GNU Compiler Collection (GCC)gcc.gnu.org
So there you have it, my plan. I have two ideas basically. One seems a bit silly but is based on my beginner’s knowledge about processors and assembly code. This is the result of my benchmarking and profiling results. The other one is to do what the person who knows this code inside out suggested.