nft_set_pipapo: Add support for 8-bit lookup groups and dynamic switch
While grouping matching bits in groups of four saves memory compared to the more natural choice of 8-bit words (lookup table size is one eighth), it comes at a performance cost, as the number of lookup comparisons is doubled, and those also needs bitshifts and masking. Introduce support for 8-bit lookup groups, together with a mapping mechanism to dynamically switch, based on defined per-table size thresholds and hysteresis, between 8-bit and 4-bit groups, as tables grow and shrink. Empty sets start with 8-bit groups, and per-field tables are converted to 4-bit groups if they get too big. An alternative approach would have been to swap per-set lookup operation functions as needed, but this doesn't allow for different group sizes in the same set, which looks desirable if some fields need significantly more matching data compared to others due to heavier impact of ranges (e.g. a big number of subnets with relatively simple port specifications). Allowing different group sizes for the same lookup functions implies the need for further conditional clauses, whose cost, however, appears to be negligible in tests. The matching rate figures below were obtained for x86_64 running the nft_concat_range.sh "performance" cases, averaged over five runs, on a single thread of an AMD Epyc 7402 CPU, and for aarch64 on a single thread of a BCM2711 (Raspberry Pi 4 Model B 4GB), clocked at a stable 2147MHz frequency: ---------------.-----------------------------------.------------. AMD Epyc 7402 | baselines, Mpps | this patch | 1 thread |___________________________________|____________| 3.35GHz | | | | | | 768KiB L1D$ | netdev | hash | rbtree | | | ---------------| hook | no | single | pipapo | pipapo | type entries | drop | ranges | field | 4 bits | bit switch | ---------------|--------|--------|--------|--------|------------| net,port | | | | | | 1000 | 19.0 | 10.4 | 3.8 | 2.8 | 4.0 +43% | ---------------|--------|--------|--------|--------|------------| port,net | | | | | | 100 | 18.8 | 10.3 | 5.8 | 5.5 | 6.3 +14% | ---------------|--------|--------|--------|--------|------------| net6,port | | | | | | 1000 | 16.4 | 7.6 | 1.8 | 1.3 | 2.1 +61% | ---------------|--------|--------|--------|--------|------------| port,proto | | | | | [1] | 30000 | 19.6 | 11.6 | 3.9 | 0.3 | 0.5 +66% | ---------------|--------|--------|--------|--------|------------| net6,port,mac | | | | | | 10 | 16.5 | 5.4 | 4.3 | 2.6 | 3.4 +31% | ---------------|--------|--------|--------|--------|------------| net6,port,mac, | | | | | | proto 1000 | 16.5 | 5.7 | 1.9 | 1.0 | 1.4 +40% | ---------------|--------|--------|--------|--------|------------| net,mac | | | | | | 1000 | 19.0 | 8.4 | 3.9 | 1.7 | 2.5 +47% | ---------------'--------'--------'--------'--------'------------' [1] Causes switch of lookup table buckets for 'port', not 'proto', to 4-bit groups ---------------.-----------------------------------.------------. BCM2711 | baselines, Mpps | this patch | 1 thread |___________________________________|____________| 2147MHz | | | | | | 32KiB L1D$ | netdev | hash | rbtree | | | ---------------| hook | no | single | pipapo | pipapo | type entries | drop | ranges | field | 4 bits | bit switch | ---------------|--------|--------|--------|--------|------------| net,port | | | | | | 1000 | 1.63 | 1.37 | 0.87 | 0.61 | 0.70 +17% | ---------------|--------|--------|--------|--------|------------| port,net | | | | | | 100 | 1.64 | 1.36 | 1.02 | 0.78 | 0.81 +4% | ---------------|--------|--------|--------|--------|------------| net6,port | | | | | | 1000 | 1.56 | 1.27 | 0.65 | 0.34 | 0.50 +47% | ---------------|--------|--------|--------|--------|------------| port,proto [2] | | | | | | 10000 | 1.68 | 1.43 | 0.84 | 0.30 | 0.40 +13% | ---------------|--------|--------|--------|--------|------------| net6,port,mac | | | | | | 10 | 1.56 | 1.14 | 1.02 | 0.62 | 0.66 +6% | ---------------|--------|--------|--------|--------|------------| net6,port,mac, | | | | | | proto 1000 | 1.56 | 1.12 | 0.64 | 0.27 | 0.40 +48% | ---------------|--------|--------|--------|--------|------------| net,mac | | | | | | 1000 | 1.63 | 1.26 | 0.87 | 0.41 | 0.53 +29% | ---------------'--------'--------'--------'--------'------------' [2] Using 10000 entries instead of 30000 as it would take way too long for the test script to generate all of them Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Please register or sign in to comment