bpf, arm64: optimize 32/64 immediate emission
Improve the JIT to emit 64 and 32 bit immediates, the current algorithm is not optimal and we often emit more instructions than actually needed. arm64 has movz, movn, movk variants but for the current 64 bit immediates we only use movz with a series of movk when needed. For example loading ffffffffffffabab emits the following 4 instructions in the JIT today: * movz: abab, shift: 0, result: 000000000000abab * movk: ffff, shift: 16, result: 00000000ffffabab * movk: ffff, shift: 32, result: 0000ffffffffabab * movk: ffff, shift: 48, result: ffffffffffffabab Whereas after the patch the same load only needs a single instruction: * movn: 5454, shift: 0, result: ffffffffffffabab Another example where two extra instructions can be saved: * movz: abab, shift: 0, result: 000000000000abab * movk: 1f2f, shift: 16, result: 000000001f2fabab * movk: ffff, shift: 32, result: 0000ffff1f2fabab * movk: ffff, shift: 48, result: ffffffff1f2fabab After the patch: * movn: e0d0, shift: 16, result: ffffffff1f2fffff * movk: abab, shift: 0, result: ffffffff1f2fabab Another example with movz, before: * movz: 0000, shift: 0, result: 0000000000000000 * movk: fea0, shift: 32, result: 0000fea000000000 After: * movz: fea0, shift: 32, result: 0000fea000000000 Moreover, reuse emit_a64_mov_i() for 32 bit immediates that are loaded via emit_a64_mov_i64() which is a similar optimization as done in 6fe8b9c1 ("bpf, x64: save several bytes by using mov over movabsq when possible"). On arm64, the latter allows to use a single instruction with movn due to zero extension where otherwise two would be needed. And last but not least add a missing optimization in emit_a64_mov_i() where movn is used but the subsequent movk not needed. With some of the Cilium programs in use, this shrinks the needed instructions by about three percent. Tested on Cavium ThunderX CN8890. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Please register or sign in to comment