Rust, SIMD and target-feature flags
This year I’ve been working on concread and one of the ways that I have improved it is through the use of packed_simd for parallel key lookups in hashmaps. During testing I saw a ~10% speed up in Kanidm which heavily relies on concread, so great, pack it up, go home.
…?
Or so I thought. Recently I was learning to use Ghidra with a friend, and as a thought exercise I wanted to see how Rust decompiled. I put the concread test suite into Ghidra and took a look. Looking at the version of concread with simd_support enabled, I saw this in the disassembly (truncated for readability).
**************************************************************
* FUNCTION *
**************************************************************
Simd<[packed_simd_2--masks--m64;8]> __stdcall eq(Simd<[p
...
100114510 55 PUSH RBP
100114511 48 89 e5 MOV RBP,RSP
100114514 48 83 e4 c0 AND RSP,-0x40
100114518 48 81 ec SUB RSP,0x100
00 01 00 00
10011451f 48 89 f8 MOV RAX,__return_storage_ptr__
100114522 0f 28 06 MOVAPS XMM0,xmmword ptr [self->__0.__0]
...
100114540 66 0f 76 c4 PCMPEQD XMM0,XMM4
100114544 66 0f 70 PSHUFD XMM4,XMM0,0xb1
e0 b1
100114549 66 0f db c4 PAND XMM0,XMM4
...
100114574 0f 29 9c MOVAPS xmmword ptr [RSP + local_90],XMM3
24 b0 00
00 00
1001145b4 48 89 7c MOV qword ptr [RSP + local_c8],__return_storage_pt
24 78
...
1001145be 0f 29 44 MOVAPS xmmword ptr [RSP + local_e0],XMM0
24 60
...
1001145d2 48 8b 44 MOV RAX,qword ptr [RSP + local_c8]
24 78
1001145d7 0f 28 44 MOVAPS XMM0,xmmword ptr [RSP + local_e0]
24 60
1001145dc 0f 29 00 MOVAPS xmmword ptr [RAX],XMM0
...
1001145ff 48 89 ec MOV RSP,RBP
100114602 5d POP RBP
100114603 c3 RET
Now, it’s been a long time since I’ve had to look at x86_64 asm, so I saw this and went “great, it’s not using a loop, those aren’t simple TEST/JNZ instructions, they have a lot of letters, awesome it’s using HW accel.
Time passes …
Coming back to this, I have been wondering how we could enable SIMD in concread at SUSE, since 389 Directory Server has just merged a change for 2.0.0 that uses concread as a cache. For this I needed to know what minimum CPU is supported at SUSE. After some chasing internally, knowing what we need I asked in the Rust Brisbane group about how you can define in packed_simd to only emit instructions that work on a minimum CPU level rather than my cpu or the builder cpu.
The response was “but that’s already how it works”.
I was helpfully directed to the packed_simd perf guide which discusses the use of target features and target cpu. At that point I realised that for this whole time I’ve only been using the default:
# rustc --print cfg | grep -i target_feature
target_feature="fxsr"
target_feature="sse"
target_feature="sse2"
The PCMPEQD is from sse2, but my cpu is much newer and should support AVX and AVX2. Retesting this, I can see my CPU has much more:
# rustc --print cfg -C target-cpu=native | grep -i target_feature
target_feature="aes"
target_feature="avx"
target_feature="avx2"
target_feature="bmi1"
target_feature="bmi2"
target_feature="fma"
target_feature="fxsr"
target_feature="lzcnt"
target_feature="pclmulqdq"
target_feature="popcnt"
target_feature="rdrand"
target_feature="rdseed"
target_feature="sse"
target_feature="sse2"
target_feature="sse3"
target_feature="sse4.1"
target_feature="sse4.2"
target_feature="ssse3"
target_feature="xsave"
target_feature="xsavec"
target_feature="xsaveopt"
target_feature="xsaves"
All this time, I haven’t been using my native features!
For local builds now, I have .cargo/config set with:
[build]
rustflags = "-C target-cpu=native"
I recompiled concread and I now see in Ghidra:
00198960 55 PUSH RBP
00198961 48 89 e5 MOV RBP,RSP
00198964 48 83 e4 c0 AND RSP,-0x40
00198968 48 81 ec SUB RSP,0x100
00 01 00 00
0019896f 48 89 f8 MOV RAX,__return_storage_ptr__
00198972 c5 fc 28 06 VMOVAPS YMM0,ymmword ptr [self->__0.__0]
00198976 c5 fc 28 VMOVAPS YMM1,ymmword ptr [RSI + self->__0.__4]
4e 20
0019897b c5 fc 28 12 VMOVAPS YMM2,ymmword ptr [other->__0.__0]
0019897f c5 fc 28 VMOVAPS YMM3,ymmword ptr [RDX + other->__0.__4]
5a 20
00198984 c4 e2 7d VPCMPEQQ YMM0,YMM0,YMM2
29 c2
00198989 c4 e2 75 VPCMPEQQ YMM1,YMM1,YMM3
29 cb
0019898e c5 fc 29 VMOVAPS ymmword ptr [RSP + local_a0[0]],YMM1
8c 24 a0
00 00 00
...
001989e7 48 89 ec MOV RSP,RBP
001989ea 5d POP RBP
001989eb c5 f8 77 VZEROUPPER
001989ee c3 RET
VPCMPEQQ is the AVX2 compare instruction (You can tell it’s AVX2 due to the register YMM, AVX uses XMM). Which means now I’m getting the SIMD comparisons I wanted!
These can be enabled with RUSTFLAGS=’-C target-feature=+avx2,+avx’ for selected builds, or in your .cargo/config. It may be a good idea for just local development to do target-cpu=native.