Skip to content

Simd swap

Rodrigo Muino Tomonari requested to merge github/fork/lucshi/simd-swap into main

Used SIMD to improve bufferswap performance including below APIs:

buffer.swap16()
buffer.swap32()
buffer.swap64()

According to Node.js buffer API design, only the buffer over 128 bytes go into C++ level implementations. This optimization only works for this case. The optimization takes effect on X86 platforms that support AVX512vbmi instructions, e.g. Cannolake, Icelake, Icelale based AWS EC2 instances.

The code is placed in deps/ folder as platform dependent.

To achieve the best performance, it adopts AVX512 permute instruction and uses inlining instead of function call because the SIMD logic is too short.

The node.js buffer swap benchmark shows big improvements after applying this optimization which vary from ~30% to ~600%.

node-benchmark-compare 
                                                                          confidence improvement accuracy (*)     (**)    (***)
buffers/buffer-swap.js n=1000000 len=1024 method='swap16' aligned='false'        ***     70.56 %       ±3.32%   ±4.56%   ±6.22%
buffers/buffer-swap.js n=1000000 len=1024 method='swap16' aligned='true'         ***     60.85 %       ±3.78%   ±5.19%   ±7.09%
buffers/buffer-swap.js n=1000000 len=1024 method='swap32' aligned='false'        ***    254.48 %       ±3.07%   ±4.40%   ±6.44%
buffers/buffer-swap.js n=1000000 len=1024 method='swap32' aligned='true'         ***    254.36 %       ±2.71%   ±3.88%   ±5.68%
buffers/buffer-swap.js n=1000000 len=1024 method='swap64' aligned='false'        ***    132.66 %       ±3.45%   ±4.95%   ±7.27%
buffers/buffer-swap.js n=1000000 len=1024 method='swap64' aligned='true'         ***    135.83 %       ±4.25%   ±6.09%   ±8.91%
buffers/buffer-swap.js n=1000000 len=2056 method='swap16' aligned='false'        ***     93.81 %       ±2.70%   ±3.71%   ±5.06%
buffers/buffer-swap.js n=1000000 len=2056 method='swap16' aligned='true'         ***     85.19 %       ±3.74%   ±5.26%   ±7.45%
buffers/buffer-swap.js n=1000000 len=2056 method='swap32' aligned='false'        ***    346.51 %       ±4.85%   ±6.90%   ±9.98%
buffers/buffer-swap.js n=1000000 len=2056 method='swap32' aligned='true'         ***    334.75 %       ±7.03%  ±10.10%  ±14.84%
buffers/buffer-swap.js n=1000000 len=2056 method='swap64' aligned='false'        ***    200.37 %       ±3.26%   ±4.68%   ±6.88%
buffers/buffer-swap.js n=1000000 len=2056 method='swap64' aligned='true'         ***    202.17 %       ±4.13%   ±5.94%   ±8.73%
buffers/buffer-swap.js n=1000000 len=256 method='swap16' aligned='false'         ***     32.23 %       ±6.39%   ±9.04%  ±12.94%
buffers/buffer-swap.js n=1000000 len=256 method='swap16' aligned='true'          ***     27.76 %       ±3.85%   ±5.31%   ±7.29%
buffers/buffer-swap.js n=1000000 len=256 method='swap32' aligned='false'         ***     82.32 %       ±4.38%   ±6.08%   ±8.44%
buffers/buffer-swap.js n=1000000 len=256 method='swap32' aligned='true'          ***     82.85 %      ±12.90%  ±18.01%  ±25.27%
buffers/buffer-swap.js n=1000000 len=256 method='swap64' aligned='false'         ***     42.91 %       ±2.00%   ±2.74%   ±3.75%
buffers/buffer-swap.js n=1000000 len=256 method='swap64' aligned='true'          ***     44.30 %       ±2.94%   ±4.03%   ±5.50%
buffers/buffer-swap.js n=1000000 len=64 method='swap16' aligned='false'                  -0.01 %       ±0.35%   ±0.48%   ±0.66%
buffers/buffer-swap.js n=1000000 len=64 method='swap16' aligned='true'                   -0.05 %       ±0.19%   ±0.26%   ±0.36%
buffers/buffer-swap.js n=1000000 len=64 method='swap32' aligned='false'                   0.38 %       ±1.76%   ±2.42%   ±3.30%
buffers/buffer-swap.js n=1000000 len=64 method='swap32' aligned='true'                    0.72 %       ±2.06%   ±2.82%   ±3.85%
buffers/buffer-swap.js n=1000000 len=64 method='swap64' aligned='false'                   0.44 %       ±1.00%   ±1.37%   ±1.86%
buffers/buffer-swap.js n=1000000 len=64 method='swap64' aligned='true'                   -0.36 %       ±0.48%   ±0.66%   ±0.90%
buffers/buffer-swap.js n=1000000 len=768 method='swap16' aligned='false'         ***     62.04 %       ±5.14%   ±7.21%  ±10.20%
buffers/buffer-swap.js n=1000000 len=768 method='swap16' aligned='true'          ***     50.11 %       ±4.66%   ±6.43%   ±8.86%
buffers/buffer-swap.js n=1000000 len=768 method='swap32' aligned='false'         ***    185.22 %       ±5.40%   ±7.75%  ±11.38%
buffers/buffer-swap.js n=1000000 len=768 method='swap32' aligned='true'          ***    186.12 %       ±2.80%   ±3.94%   ±5.59%
buffers/buffer-swap.js n=1000000 len=768 method='swap64' aligned='false'         ***    107.24 %       ±2.48%   ±3.56%   ±5.24%
buffers/buffer-swap.js n=1000000 len=768 method='swap64' aligned='true'          ***    109.04 %       ±5.19%   ±7.45%  ±10.96%
buffers/buffer-swap.js n=1000000 len=8192 method='swap16' aligned='false'        ***    104.20 %       ±2.08%   ±2.85%   ±3.89%
buffers/buffer-swap.js n=1000000 len=8192 method='swap16' aligned='true'         ***    176.63 %      ±58.59%  ±84.16% ±123.81%
buffers/buffer-swap.js n=1000000 len=8192 method='swap32' aligned='false'        ***    450.33 %       ±4.58%   ±6.57%   ±9.65%
buffers/buffer-swap.js n=1000000 len=8192 method='swap32' aligned='true'         ***    618.62 %     ±161.33% ±231.77% ±340.97%
buffers/buffer-swap.js n=1000000 len=8192 method='swap64' aligned='false'        ***    264.54 %       ±3.58%   ±5.15%   ±7.58%
buffers/buffer-swap.js n=1000000 len=8192 method='swap64' aligned='true'         ***    384.05 %     ±110.97% ±159.42% ±234.52%

Merge request reports

Loading