aarch64: Optimized implementation of strcpy
Optimize the strcpy implementation by using vector loads and operations
in main loop.Compared to aarch64/strcpy.S, it reduces latency of cases
in bench-strlen by 5%~18% when the length of src is greater than 64
bytes, with gains throughout the benchmark.
Checked on aarch64-linux-gnu.
Reviewed-by:
Wilco Dijkstra <Wilco.Dijkstra@arm.com>
Loading