Rust, SIMD and target-feature flags
This year I've been working on concread and one of the ways that I have improved it is through the use of packed_simd for parallel key lookups in hashmaps. During testing I saw a ~10% speed up in Kanidm which heavily relies on concread, so great, pack it up, go home.
...?
Or so I thought. Recently I was learning to use Ghidra with a friend, and as a thought exercise I wanted to see how Rust decompiled. I put the concread test suite into Ghidra and took a look. Looking at the version of concread with [simd_support]{.title-ref} enabled, I saw this in the disassembly (truncated for readability).
**************************************************************
* FUNCTION *
**************************************************************
Simd<[packed_simd_2--masks--m64;8]> __stdcall eq(Simd<[p
...
100114510 55 PUSH RBP
100114511 48 89 e5 MOV RBP,RSP
100114514 48 83 e4 c0 AND RSP,-0x40
100114518 48 81 ec SUB RSP,0x100
00 01 00 00
10011451f 48 89 f8 MOV RAX,__return_storage_ptr__
100114522 0f 28 06 MOVAPS XMM0,xmmword ptr [self->__0.__0]
...
100114540 66 0f 76 c4 PCMPEQD XMM0,XMM4
100114544 66 0f 70 PSHUFD XMM4,XMM0,0xb1
e0 b1
100114549 66 0f db c4 PAND XMM0,XMM4
...
100114574 0f 29 9c MOVAPS xmmword ptr [RSP + local_90],XMM3
24 b0 00
00 00
1001145b4 48 89 7c MOV qword ptr [RSP + local_c8],__return_storage_pt
24 78
...
1001145be 0f 29 44 MOVAPS xmmword ptr [RSP + local_e0],XMM0
24 60
...
1001145d2 48 8b 44 MOV RAX,qword ptr [RSP + local_c8]
24 78
1001145d7 0f 28 44 MOVAPS XMM0,xmmword ptr [RSP + local_e0]
24 60
1001145dc 0f 29 00 MOVAPS xmmword ptr [RAX],XMM0
...
1001145ff 48 89 ec MOV RSP,RBP
100114602 5d POP RBP
100114603 c3 RET
Now, it's been a long time since I've had to look at x86_64 asm, so I saw this and went "great, it's not using a loop, those aren't simple [TEST/JNZ]{.title-ref} instructions, they have a lot of letters, awesome it's using HW accel.
Time passes ...
Coming back to this, I have been wondering how we could enable SIMD in concread at SUSE, since 389 Directory Server has just merged a change for 2.0.0 that uses concread as a cache. For this I needed to know what minimum CPU is supported at SUSE. After some chasing internally, knowing what we need I asked in the Rust Brisbane group about how you can define in [packed_simd]{.title-ref} to only emit instructions that work on a minimum CPU level rather than my cpu or the builder cpu.
The response was "but that's already how it works".
I was helpfully directed to the packed_simd perf guide which discusses the use of target features and target cpu. At that point I realised that for this whole time I've only been using the default:
# rustc --print cfg | grep -i target_feature
target_feature="fxsr"
target_feature="sse"
target_feature="sse2"
The [PCMPEQD]{.title-ref} is from sse2, but my cpu is much newer and should support AVX and AVX2. Retesting this, I can see my CPU has much more:
# rustc --print cfg -C target-cpu=native | grep -i target_feature
target_feature="aes"
target_feature="avx"
target_feature="avx2"
target_feature="bmi1"
target_feature="bmi2"
target_feature="fma"
target_feature="fxsr"
target_feature="lzcnt"
target_feature="pclmulqdq"
target_feature="popcnt"
target_feature="rdrand"
target_feature="rdseed"
target_feature="sse"
target_feature="sse2"
target_feature="sse3"
target_feature="sse4.1"
target_feature="sse4.2"
target_feature="ssse3"
target_feature="xsave"
target_feature="xsavec"
target_feature="xsaveopt"
target_feature="xsaves"
All this time, I haven't been using my native features!
For local builds now, I have .cargo/config set with:
[build]
rustflags = "-C target-cpu=native"
I recompiled concread and I now see in Ghidra:
00198960 55 PUSH RBP
00198961 48 89 e5 MOV RBP,RSP
00198964 48 83 e4 c0 AND RSP,-0x40
00198968 48 81 ec SUB RSP,0x100
00 01 00 00
0019896f 48 89 f8 MOV RAX,__return_storage_ptr__
00198972 c5 fc 28 06 VMOVAPS YMM0,ymmword ptr [self->__0.__0]
00198976 c5 fc 28 VMOVAPS YMM1,ymmword ptr [RSI + self->__0.__4]
4e 20
0019897b c5 fc 28 12 VMOVAPS YMM2,ymmword ptr [other->__0.__0]
0019897f c5 fc 28 VMOVAPS YMM3,ymmword ptr [RDX + other->__0.__4]
5a 20
00198984 c4 e2 7d VPCMPEQQ YMM0,YMM0,YMM2
29 c2
00198989 c4 e2 75 VPCMPEQQ YMM1,YMM1,YMM3
29 cb
0019898e c5 fc 29 VMOVAPS ymmword ptr [RSP + local_a0[0]],YMM1
8c 24 a0
00 00 00
...
001989e7 48 89 ec MOV RSP,RBP
001989ea 5d POP RBP
001989eb c5 f8 77 VZEROUPPER
001989ee c3 RET
[VPCMPEQQ]{.title-ref} is the AVX2 compare instruction (You can tell it's AVX2 due to the register YMM, AVX uses XMM). Which means now I'm getting the SIMD comparisons I wanted!
These can be enabled with [RUSTFLAGS='-C target-feature=+avx2,+avx']{.title-ref} for selected builds, or in your .cargo/config. It may be a good idea for just local development to do [target-cpu=native]{.title-ref}.