Skip to content
Commit 75c1aee5 authored by Anton Youdkevitch's avatar Anton Youdkevitch Committed by Steve Ellcey
Browse files

aarch64: optimized memcpy implementation for thunderx2

Since aligned loads and stores are huge performance
advantage the implementation always tries to do aligned
access. Among the cases when src and dst addresses are
aligned or unaligned evenly there are cases of not evenly
unaligned src and dst. For such cases (if the length is
big enough) ext instruction is used to merge-and-shift
two memory chunks loaded from two adjacent aligned
locations and then the adjusted chunk gets stored to
aligned address.

Performance gain against the current T2 implementation:
     memcpy-large: 65K-32M: +40% - +10%
     memcpy-walk:  128-32M: +20% - +2%
parent bcdb1bfa
Loading
Loading
Loading
Loading
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment