Commit c4f4b53e authored Jun 27, 2022 by Danila Kutenin Committed by Wilco Dijkstra Apr 09, 2024

aarch64: Optimize string functions with shrn instruction

We found that string functions were using AND+ADDP
to find the nibble/syndrome mask but there is an easier
opportunity through `SHRN dst.8b, src.8h, 4` (shift
right every 2 bytes by 4 and narrow to 1 byte) and has
same latency on all SIMD ARMv8 targets as ADDP. There
are also possible gaps for memcmp but that's for
another patch.

We see 10-20% savings for small-mid size cases (<=128)
which are primary cases for general workloads.

(cherry picked from commit 3c998069)

parent 3393c72e

Show whitespace changes

Inline Side-by-side

Please register or to comment