x86: Increase `non_temporal_threshold` to roughly `sizeof_L3 / 4`
Current `non_temporal_threshold` set to roughly '3/4 * sizeof_L3 / ncores_per_socket'. This patch updates that value to roughly 'sizeof_L3 / 4` The original value (specifically dividing the `ncores_per_socket`) was done to limit the amount of other threads' data a `memcpy`/`memset` could evict. Dividing by 'ncores_per_socket', however leads to exceedingly low non-temporal thresholds and leads to using non-temporal stores in cases where REP MOVSB is multiple times faster. Furthermore, non-temporal stores are written directly to main memory so using it at a size much smaller than L3 can place soon to be accessed data much further away than it otherwise could be. As well, modern machines are able to detect streaming patterns (especially if REP MOVSB is used) and provide LRU hints to the memory subsystem. This in affect caps the total amount of eviction at 1/cache_associativity, far below meaningfully thrashing the entire cache. As best I can tell, the benchmarks that lead this small threshold where done comparing non-temporal stores versus standard cacheable stores. A better comparison (linked below) is to be REP MOVSB which, on the measure systems, is nearly 2x faster than non-temporal stores at the low-end of the previous threshold, and within 10% for over 100MB copies (well past even the current threshold). In cases with a low number of threads competing for bandwidth, REP MOVSB is ~2x faster up to `sizeof_L3`. The divisor of `4` is a somewhat arbitrary value. From benchmarks it seems Skylake and Icelake both prefer a divisor of `2`, but older CPUs such as Broadwell prefer something closer to `8`. This patch is meant to be followed up by another one to make the divisor cpu-specific, but in the meantime (and for easier backporting), this patch settles on `4` as a middle-ground. Benchmarks comparing non-temporal stores, REP MOVSB, and cacheable stores where done using: https://github.com/goldsteinn/memcpy-nt-benchmarks Sheets results (also available in pdf on the github): https://docs.google.com/spreadsheets/d/e/2PACX-1vS183r0rW_jRX6tG_E90m9qVuFiMbRIJvi5VAE8yYOvEOIEEc3aSNuEsrFbuXw5c3nGboxMmrupZD7K/pubhtml Reviewed-by:DJ Delorie <dj@redhat.com> Reviewed-by:
Carlos O'Donell <carlos@redhat.com>
-
mentioned in commit 47f74721
-
mentioned in commit 084fb31b
-
mentioned in commit 5ea70cc0
-
mentioned in commit 80a8c858
-
mentioned in commit 1caf9552
-
mentioned in commit f94ff95e
-
mentioned in commit 521afc96
-
mentioned in commit c8c0aac6
-
mentioned in commit 7a6b1f06
-
mentioned in commit d1b1da26
-
mentioned in commit 47c7d2eb
-
mentioned in commit 561e9dad
-
mentioned in commit 42c266a1
-
mentioned in commit 2e74c901
-
mentioned in commit eb8bf044
-
mentioned in commit ba37e6a4
-
mentioned in commit 402324a7
-
mentioned in commit 29f401e0
-
mentioned in commit d4386d34
-
mentioned in commit bb7c5721
-
mentioned in commit f578da10
-
mentioned in commit 86418cbe
-
mentioned in commit e8132392
-
mentioned in commit 863fc577
-
mentioned in commit 9e5693b4
-
mentioned in commit 047968e8
-
mentioned in commit 9f27ef80
-
mentioned in commit b462b80b
-
mentioned in commit 077f1f78
-
mentioned in commit 05c28930
-
mentioned in commit 31b06441
-
mentioned in commit 5e2d2d7c
-
mentioned in commit 1a200935