Rust, SIMD and target-feature flags

Rust, SIMD and target-feature flags

This year I've been working on concread and one of the ways that I have improved it is through the use of packed_simd for parallel key lookups in hashmaps. During testing I saw a ~10% speed up in Kanidm which heavily relies on concread, so great, pack it up, go home.

...?

Or so I thought. Recently I was learning to use Ghidra with a friend, and as a thought exercise I wanted to see how Rust decompiled. I put the concread test suite into Ghidra and took a look. Looking at the version of concread with [simd_support]{.title-ref} enabled, I saw this in the disassembly (truncated for readability).

**************************************************************
*                          FUNCTION                          *
**************************************************************
Simd<[packed_simd_2--masks--m64;8]> __stdcall eq(Simd<[p
...
100114510 55              PUSH       RBP
100114511 48 89 e5        MOV        RBP,RSP
100114514 48 83 e4 c0     AND        RSP,-0x40
100114518 48 81 ec        SUB        RSP,0x100
        00 01 00 00
10011451f 48 89 f8        MOV        RAX,__return_storage_ptr__
100114522 0f 28 06        MOVAPS     XMM0,xmmword ptr [self->__0.__0]
...
100114540 66 0f 76 c4     PCMPEQD    XMM0,XMM4
100114544 66 0f 70        PSHUFD     XMM4,XMM0,0xb1
        e0 b1
100114549 66 0f db c4     PAND       XMM0,XMM4
...
100114574 0f 29 9c        MOVAPS     xmmword ptr [RSP + local_90],XMM3
        24 b0 00 
        00 00
1001145b4 48 89 7c        MOV        qword ptr [RSP + local_c8],__return_storage_pt
        24 78
...
1001145be 0f 29 44        MOVAPS     xmmword ptr [RSP + local_e0],XMM0
        24 60
...
1001145d2 48 8b 44        MOV        RAX,qword ptr [RSP + local_c8]
        24 78
1001145d7 0f 28 44        MOVAPS     XMM0,xmmword ptr [RSP + local_e0]
        24 60
1001145dc 0f 29 00        MOVAPS     xmmword ptr [RAX],XMM0
...
1001145ff 48 89 ec        MOV        RSP,RBP
100114602 5d              POP        RBP
100114603 c3              RET

Now, it's been a long time since I've had to look at x86_64 asm, so I saw this and went "great, it's not using a loop, those aren't simple [TEST/JNZ]{.title-ref} instructions, they have a lot of letters, awesome it's using HW accel.

Time passes ...

Coming back to this, I have been wondering how we could enable SIMD in concread at SUSE, since 389 Directory Server has just merged a change for 2.0.0 that uses concread as a cache. For this I needed to know what minimum CPU is supported at SUSE. After some chasing internally, knowing what we need I asked in the Rust Brisbane group about how you can define in [packed_simd]{.title-ref} to only emit instructions that work on a minimum CPU level rather than my cpu or the builder cpu.

The response was "but that's already how it works".

I was helpfully directed to the packed_simd perf guide which discusses the use of target features and target cpu. At that point I realised that for this whole time I've only been using the default:

# rustc --print cfg | grep -i target_feature
target_feature="fxsr"
target_feature="sse"
target_feature="sse2"

The [PCMPEQD]{.title-ref} is from sse2, but my cpu is much newer and should support AVX and AVX2. Retesting this, I can see my CPU has much more:

# rustc --print cfg -C target-cpu=native | grep -i target_feature
target_feature="aes"
target_feature="avx"
target_feature="avx2"
target_feature="bmi1"
target_feature="bmi2"
target_feature="fma"
target_feature="fxsr"
target_feature="lzcnt"
target_feature="pclmulqdq"
target_feature="popcnt"
target_feature="rdrand"
target_feature="rdseed"
target_feature="sse"
target_feature="sse2"
target_feature="sse3"
target_feature="sse4.1"
target_feature="sse4.2"
target_feature="ssse3"
target_feature="xsave"
target_feature="xsavec"
target_feature="xsaveopt"
target_feature="xsaves"

All this time, I haven't been using my native features!

For local builds now, I have .cargo/config set with:

[build]
rustflags = "-C target-cpu=native"

I recompiled concread and I now see in Ghidra:

00198960 55              PUSH       RBP
00198961 48 89 e5        MOV        RBP,RSP
00198964 48 83 e4 c0     AND        RSP,-0x40
00198968 48 81 ec        SUB        RSP,0x100
         00 01 00 00
0019896f 48 89 f8        MOV        RAX,__return_storage_ptr__
00198972 c5 fc 28 06     VMOVAPS    YMM0,ymmword ptr [self->__0.__0]
00198976 c5 fc 28        VMOVAPS    YMM1,ymmword ptr [RSI + self->__0.__4]
         4e 20
0019897b c5 fc 28 12     VMOVAPS    YMM2,ymmword ptr [other->__0.__0]
0019897f c5 fc 28        VMOVAPS    YMM3,ymmword ptr [RDX + other->__0.__4]
         5a 20
00198984 c4 e2 7d        VPCMPEQQ   YMM0,YMM0,YMM2
         29 c2
00198989 c4 e2 75        VPCMPEQQ   YMM1,YMM1,YMM3
         29 cb
0019898e c5 fc 29        VMOVAPS    ymmword ptr [RSP + local_a0[0]],YMM1
         8c 24 a0 
         00 00 00
...
001989e7 48 89 ec        MOV        RSP,RBP
001989ea 5d              POP        RBP
001989eb c5 f8 77        VZEROUPPER
001989ee c3              RET

[VPCMPEQQ]{.title-ref} is the AVX2 compare instruction (You can tell it's AVX2 due to the register YMM, AVX uses XMM). Which means now I'm getting the SIMD comparisons I wanted!

These can be enabled with [RUSTFLAGS='-C target-feature=+avx2,+avx']{.title-ref} for selected builds, or in your .cargo/config. It may be a good idea for just local development to do [target-cpu=native]{.title-ref}.