在 64bit CPU 上計算 16bit 數字的 bit 數

fcamel

Published in

fcamel的程式開發心得

5 min readOct 15, 2016

最近看到《Google’s “Director of Engineering” Hiring Test》，覺得作者 Pierre Gauthier 的技術能力超標於這個初階測驗。從他的回答學到了一些東西，這篇針對第九題作個簡單的測試。

題目：There’s an array of 10,000 16-bit values, how do you count the bits most efficiently?

官方答案是查表法，用空間換取時間的技巧。想想這應該就是最佳解了吧，從演算法的角度來說已是 O(1) 了。

Pierre Gauthier 卻回答：

Me: there are faster methods processing 64-bit words with masks but I can’t explain it over the phone, I must write code.
Recruiter: the correct answer is to use a lookup table and then sum the results.
Me: on which kind of CPU? Why not let me compare my code to yours in a benchmark?

看到他的回答，才想起以前有讀過有位元計算的技巧可以快速算出 bit 數。一次取 4 個 16bit 數字當作一個 64bit 數字，再用特殊技巧算出 bit 數，的確有可能更快。

好奇兩者的效率差異，就依兩個解法的概念各寫了一份實作，原始碼和測試程式在這裡。下面是測試結果。

iMac 4 GHz Intel Core i7

Ubuntu VM (Parallel) on iMac

比較結果

取 4 個 16bit 數字當作一個 64bit 數字用 mask 計算的確比較快，而且在 g++ -O3 on Linux 的情況快了 ~2.5 倍！之後再找時間看編出來的組語，看看為什麼會有如此差異。

另外 Parallel 效能很好，VM 上的執行速度和 Host 沒什麼差。

其它

Pierre Gauthier 針對 quick sort 的看法也很有趣：

Me: “big-O” ignores data storage latencies, topology, volume, available memory, and even the computational cost of every CPU instructions involved in a given implementation — instead, it merely counts the number of algorithmic operations! Big-O can be a valuable indication when designing algorithms but the best performing and scaling solution depends on the particular constraints of any specific problem and environment.
…
The Linux kernel (that Google relies on) opted for Heapsort rather than Quicksort… for its lower memory usage and more predictable execution time.

這兩個例子提醒了我：演算法和效率分析讓我們可以避免使用明顯錯誤的方法，但實務上仍有許多細節會影響時間和空間的效率，需要多留意。

2016–10–16 Update

1. 從 Po-Wei Wang 那得知《SSSE3: fast popcount》，有更完整的比較。抓了程式在我的環境跑，其中 builtin-popcnt-unrolled-errata-manual 的效率最好，所以就將這部份實作加到我的測試裡。這個作法直接用組語呼叫 SSE4 的 popcnt。在 VM 或 Host 上跑，執行速度沒有差異。下面是在 Ubuntu VM 執行的結果：

用 SSE4 的指令 popcnt 竟然快了 ~8.5 倍，真是驚人。硬體加速才是王道啊！

2. Wens 提到 ARM 也有類似的指令，不過我懶得在 ARM 的環境測試 …。

3. 聊到關於 Pierre Gauthier 對 Big-O 的看法，Scott 提到 “Latency Numbers Every Programmer Should Know Raw”。雖然看過這個表好幾次，還是沒有印到腦子裡。藉這次的機會，有了比較深的體會。當演算法的效率到理論最佳值時，就得留意每步操作在現行計算機架構下的執行時間。即使指令數比較多，若能藉此減少 memory load/store 的次數，有可能更省時間。

4. 我後來又補上暴力法和用 8-bit 數字建表的實作。藉此觀察 memory load 和 cache miss 的影響，結果算是符合推測。下面是在 Ubuntu VM 的執行：

單從演算法效率來說， count_directly 是 count_by_table 「16倍」慢，但結果是 ~5.5 倍慢。應該是因為 count_directly 少了一次 memory load，而 memory_load 的速度比讀寫 register 慢不少，所以雖然是 16倍的操作，速度不是 16 倍慢。

count_by_table8 是用 8-bit 數字建表，查表次數是 count_by_table (用16-bit 數字的表) 的兩倍，但 count_by_table8 的表只有 256 個數字，比較不容易 cache miss，所以執行時間差不多，沒有慢兩倍。

count_by_bit_operation 編出來的組語多了相當多指令，但是它沒有查表。換句話說，處理 4 個 16-bit 數字時，少了 4 次 memory load。這應該是它之所以快的關鍵。