Merge branch 'netfilter-flowtable' (4b837ad5) · Commits · EulixOS / Software / Kernel

Documentation/networking/nf_flowtable.rst

+143 −27

Original line number	Diff line number	Diff line
		@@ -4,35 +4,38 @@
		Netfilter's flowtable infrastructure
		====================================

		This documentation describes the software flowtable infrastructure available in
		Netfilter since Linux kernel 4.16.
		This documentation describes the Netfilter flowtable infrastructure which allows
		you to define a fastpath through the flowtable datapath. This infrastructure
		also provides hardware offload support. The flowtable supports for the layer 3
		IPv4 and IPv6 and the layer 4 TCP and UDP protocols.

		Overview
		--------

		Initial packets follow the classic forwarding path, once the flow enters the
		established state according to the conntrack semantics (ie. we have seen traffic
		in both directions), then you can decide to offload the flow to the flowtable
		from the forward chain via the 'flow offload' action available in nftables.
		Once the first packet of the flow successfully goes through the IP forwarding
		path, from the second packet on, you might decide to offload the flow to the
		flowtable through your ruleset. The flowtable infrastructure provides a rule
		action that allows you to specify when to add a flow to the flowtable.

		Packets that find an entry in the flowtable (ie. flowtable hit) are sent to the
		output netdevice via neigh_xmit(), hence, they bypass the classic forwarding
		path (the visible effect is that you do not see these packets from any of the
		netfilter hooks coming after the ingress). In case of flowtable miss, the packet
		follows the classic forward path.
		A packet that finds a matching entry in the flowtable (ie. flowtable hit) is
		transmitted to the output netdevice via neigh_xmit(), hence, packets bypass the
		classic IP forwarding path (the visible effect is that you do not see these
		packets from any of the Netfilter hooks coming after ingress). In case that
		there is no matching entry in the flowtable (ie. flowtable miss), the packet
		follows the classic IP forwarding path.

		The flowtable uses a resizable hashtable, lookups are based on the following
		7-tuple selectors: source, destination, layer 3 and layer 4 protocols, source
		and destination ports and the input interface (useful in case there are several
		conntrack zones in place).
		The flowtable uses a resizable hashtable. Lookups are based on the following
		n-tuple selectors: layer 2 protocol encapsulation (VLAN and PPPoE), layer 3
		source and destination, layer 4 source and destination ports and the input
		interface (useful in case there are several conntrack zones in place).

		Flowtables are populated via the 'flow offload' nftables action, so the user can
		selectively specify what flows are placed into the flow table. Hence, packets
		follow the classic forwarding path unless the user explicitly instruct packets
		to use this new alternative forwarding path via nftables policy.
		The 'flow add' action allows you to populate the flowtable, the user selectively
		specifies what flows are placed into the flowtable. Hence, packets follow the
		classic IP forwarding path unless the user explicitly instruct flows to use this
		new alternative forwarding path via policy.

		This is represented in Fig.1, which describes the classic forwarding path
		including the Netfilter hooks and the flowtable fastpath bypass.
		The flowtable datapath is represented in Fig.1, which describes the classic IP
		forwarding path including the Netfilter hooks and the flowtable fastpath bypass.

		::

		@@ -67,11 +70,13 @@ including the Netfilter hooks and the flowtable fastpath bypass.
		Fig.1 Netfilter hooks and flowtable interactions

		The flowtable entry also stores the NAT configuration, so all packets are
		mangled according to the NAT policy that matches the initial packets that went
		through the classic forwarding path. The TTL is decremented before calling
		neigh_xmit(). Fragmented traffic is passed up to follow the classic forwarding
		path given that the transport selectors are missing, therefore flowtable lookup
		is not possible.
		mangled according to the NAT policy that is specified from the classic IP
		forwarding path. The TTL is decremented before calling neigh_xmit(). Fragmented
		traffic is passed up to follow the classic IP forwarding path given that the
		transport header is missing, in this case, flowtable lookups are not possible.
		TCP RST and FIN packets are also passed up to the classic IP forwarding path to
		release the flow gracefully. Packets that exceed the MTU are also passed up to
		the classic forwarding path to report packet-too-big ICMP errors to the sender.

		Example configuration
		---------------------
		@@ -85,7 +90,7 @@ flowtable and add one rule to your forward chain::
		}
		chain y {
		type filter hook forward priority 0; policy accept;
		ip protocol tcp flow offload @f
		ip protocol tcp flow add @f
		counter packets 0 bytes 0
		}
		}
		@@ -103,6 +108,117 @@ flow is offloaded, you will observe that the counter rule in the example above
		does not get updated for the packets that are being forwarded through the
		forwarding bypass.

		You can identify offloaded flows through the [OFFLOAD] tag when listing your
		connection tracking table.

		::
		# conntrack -L
		tcp 6 src=10.141.10.2 dst=192.168.10.2 sport=52728 dport=5201 src=192.168.10.2 dst=192.168.10.1 sport=5201 dport=52728 [OFFLOAD] mark=0 use=2


		Layer 2 encapsulation
		---------------------

		Since Linux kernel 5.13, the flowtable infrastructure discovers the real
		netdevice behind VLAN and PPPoE netdevices. The flowtable software datapath
		parses the VLAN and PPPoE layer 2 headers to extract the ethertype and the
		VLAN ID / PPPoE session ID which are used for the flowtable lookups. The
		flowtable datapath also deals with layer 2 decapsulation.

		You do not need to add the PPPoE and the VLAN devices to your flowtable,
		instead the real device is sufficient for the flowtable to track your flows.

		Bridge and IP forwarding
		------------------------

		Since Linux kernel 5.13, you can add bridge ports to the flowtable. The
		flowtable infrastructure discovers the topology behind the bridge device. This
		allows the flowtable to define a fastpath bypass between the bridge ports
		(represented as eth1 and eth2 in the example figure below) and the gateway
		device (represented as eth0) in your switch/router.

		::
		fastpath bypass
		.-------------------------.
		/ \
		\| IP forwarding \|
		\| / \ \/
		\| br0 eth0 ..... eth0
		. / \ host B
		-> eth1 eth2
		. switch/router
		.
		.
		eth0
		host A

		The flowtable infrastructure also supports for bridge VLAN filtering actions
		such as PVID and untagged. You can also stack a classic VLAN device on top of
		your bridge port.

		If you would like that your flowtable defines a fastpath between your bridge
		ports and your IP forwarding path, you have to add your bridge ports (as
		represented by the real netdevice) to your flowtable definition.

		Counters
		--------

		The flowtable can synchronize packet and byte counters with the existing
		connection tracking entry by specifying the counter statement in your flowtable
		definition, e.g.

		::
		table inet x {
		flowtable f {
		hook ingress priority 0; devices = { eth0, eth1 };
		counter
		}
		...
		}

		Counter support is available since Linux kernel 5.7.

		Hardware offload
		----------------

		If your network device provides hardware offload support, you can turn it on by
		means of the 'offload' flag in your flowtable definition, e.g.

		::
		table inet x {
		flowtable f {
		hook ingress priority 0; devices = { eth0, eth1 };
		flags offload;
		}
		...
		}

		There is a workqueue that adds the flows to the hardware. Note that a few
		packets might still run over the flowtable software path until the workqueue has
		a chance to offload the flow to the network device.

		You can identify hardware offloaded flows through the [HW_OFFLOAD] tag when
		listing your connection tracking table. Please, note that the [OFFLOAD] tag
		refers to the software offload mode, so there is a distinction between [OFFLOAD]
		which refers to the software flowtable fastpath and [HW_OFFLOAD] which refers
		to the hardware offload datapath being used by the flow.

		The flowtable hardware offload infrastructure also supports for the DSA
		(Distributed Switch Architecture).

		Limitations
		-----------

		The flowtable behaves like a cache. The flowtable entries might get stale if
		either the destination MAC address or the egress netdevice that is used for
		transmission changes.

		This might be a problem if:

		- You run the flowtable in software mode and you combine bridge and IP
		forwarding in your setup.
		- Hardware offload is enabled.

		More reading
		------------

drivers/net/ethernet/mediatek/Makefile

+1 −1

Original line number	Diff line number	Diff line
		@@ -4,5 +4,5 @@
		#

		obj-$(CONFIG_NET_MEDIATEK_SOC) += mtk_eth.o
		mtk_eth-y := mtk_eth_soc.o mtk_sgmii.o mtk_eth_path.o
		mtk_eth-y := mtk_eth_soc.o mtk_sgmii.o mtk_eth_path.o mtk_ppe.o mtk_ppe_debugfs.o mtk_ppe_offload.o
		obj-$(CONFIG_NET_MEDIATEK_STAR_EMAC) += mtk_star_emac.o

drivers/net/ethernet/mediatek/mtk_eth_soc.c

+33 −8

Original line number	Diff line number	Diff line
		@@ -19,6 +19,7 @@
		#include <linux/interrupt.h>
		#include <linux/pinctrl/devinfo.h>
		#include <linux/phylink.h>
		#include <net/dsa.h>

		#include "mtk_eth_soc.h"

		@@ -1264,13 +1265,12 @@ static int mtk_poll_rx(struct napi_struct *napi, int budget,
		break;

		/* find out which mac the packet come from. values start at 1 */
		if (MTK_HAS_CAPS(eth->soc->caps, MTK_SOC_MT7628)) {
		if (MTK_HAS_CAPS(eth->soc->caps, MTK_SOC_MT7628) \|\|
		(trxd.rxd4 & RX_DMA_SPECIAL_TAG))
		mac = 0;
		} else {
		mac = (trxd.rxd4 >> RX_DMA_FPORT_SHIFT) &
		RX_DMA_FPORT_MASK;
		mac--;
		}
		else
		mac = ((trxd.rxd4 >> RX_DMA_FPORT_SHIFT) &
		RX_DMA_FPORT_MASK) - 1;

		if (unlikely(mac < 0 \|\| mac >= MTK_MAC_COUNT \|\|
		!eth->netdev[mac]))
		@@ -2233,6 +2233,9 @@ static void mtk_gdm_config(struct mtk_eth *eth, u32 config)

		val \|= config;

		if (!i && eth->netdev[0] && netdev_uses_dsa(eth->netdev[0]))
		val \|= MTK_GDMA_SPECIAL_TAG;

		mtk_w32(eth, val, MTK_GDMA_FWD_CFG(i));
		}
		/* Reset and enable PSE */
		@@ -2255,12 +2258,17 @@ static int mtk_open(struct net_device *dev)

		/* we run 2 netdevs on the same dma ring so we only bring it up once */
		if (!refcount_read(&eth->dma_refcnt)) {
		int err = mtk_start_dma(eth);
		u32 gdm_config = MTK_GDMA_TO_PDMA;
		int err;

		err = mtk_start_dma(eth);
		if (err)
		return err;

		mtk_gdm_config(eth, MTK_GDMA_TO_PDMA);
		if (eth->soc->offload_version && mtk_ppe_start(&eth->ppe) == 0)
		gdm_config = MTK_GDMA_TO_PPE;

		mtk_gdm_config(eth, gdm_config);

		napi_enable(&eth->tx_napi);
		napi_enable(&eth->rx_napi);
		@@ -2327,6 +2335,9 @@ static int mtk_stop(struct net_device *dev)

		mtk_dma_free(eth);

		if (eth->soc->offload_version)
		mtk_ppe_stop(&eth->ppe);

		return 0;
		}

		@@ -2832,6 +2843,7 @@ static const struct net_device_ops mtk_netdev_ops = {
		#ifdef CONFIG_NET_POLL_CONTROLLER
		.ndo_poll_controller = mtk_poll_controller,
		#endif
		.ndo_setup_tc = mtk_eth_setup_tc,
		};

		static int mtk_add_mac(struct mtk_eth eth, struct device_node np)
		@@ -3088,6 +3100,17 @@ static int mtk_probe(struct platform_device *pdev)
		goto err_free_dev;
		}

		if (eth->soc->offload_version) {
		err = mtk_ppe_init(&eth->ppe, eth->dev,
		eth->base + MTK_ETH_PPE_BASE, 2);
		if (err)
		goto err_free_dev;

		err = mtk_eth_offload_init(eth);
		if (err)
		goto err_free_dev;
		}

		for (i = 0; i < MTK_MAX_DEVS; i++) {
		if (!eth->netdev[i])
		continue;
		@@ -3162,6 +3185,7 @@ static const struct mtk_soc_data mt7621_data = {
		.hw_features = MTK_HW_FEATURES,
		.required_clks = MT7621_CLKS_BITMAP,
		.required_pctl = false,
		.offload_version = 2,
		};

		static const struct mtk_soc_data mt7622_data = {
		@@ -3170,6 +3194,7 @@ static const struct mtk_soc_data mt7622_data = {
		.hw_features = MTK_HW_FEATURES,
		.required_clks = MT7622_CLKS_BITMAP,
		.required_pctl = false,
		.offload_version = 2,
		};

		static const struct mtk_soc_data mt7623_data = {

drivers/net/ethernet/mediatek/mtk_eth_soc.h

+22 −1

Original line number	Diff line number	Diff line
		@@ -15,6 +15,8 @@
		#include <linux/u64_stats_sync.h>
		#include <linux/refcount.h>
		#include <linux/phylink.h>
		#include <linux/rhashtable.h>
		#include "mtk_ppe.h"

		#define MTK_QDMA_PAGE_SIZE 2048
		#define MTK_MAX_RX_LENGTH 1536
		@@ -40,7 +42,8 @@
		NETIF_F_HW_VLAN_CTAG_RX \| \
		NETIF_F_SG \| NETIF_F_TSO \| \
		NETIF_F_TSO6 \| \
		NETIF_F_IPV6_CSUM)
		NETIF_F_IPV6_CSUM \|\
		NETIF_F_HW_TC)
		#define MTK_HW_FEATURES_MT7628 (NETIF_F_SG \| NETIF_F_RXCSUM)
		#define NEXT_DESP_IDX(X, Y) (((X) + 1) & ((Y) - 1))

		@@ -82,10 +85,12 @@

		/* GDM Exgress Control Register */
		#define MTK_GDMA_FWD_CFG(x) (0x500 + (x * 0x1000))
		#define MTK_GDMA_SPECIAL_TAG BIT(24)
		#define MTK_GDMA_ICS_EN BIT(22)
		#define MTK_GDMA_TCS_EN BIT(21)
		#define MTK_GDMA_UCS_EN BIT(20)
		#define MTK_GDMA_TO_PDMA 0x0
		#define MTK_GDMA_TO_PPE 0x4444
		#define MTK_GDMA_DROP_ALL 0x7777

		/* Unicast Filter MAC Address Register - Low */
		@@ -300,11 +305,18 @@
		/* QDMA descriptor rxd3 */
		#define RX_DMA_VID(_x) ((_x) & 0xfff)

		/* QDMA descriptor rxd4 */
		#define MTK_RXD4_FOE_ENTRY GENMASK(13, 0)
		#define MTK_RXD4_PPE_CPU_REASON GENMASK(18, 14)
		#define MTK_RXD4_SRC_PORT GENMASK(21, 19)
		#define MTK_RXD4_ALG GENMASK(31, 22)

		/* QDMA descriptor rxd4 */
		#define RX_DMA_L4_VALID BIT(24)
		#define RX_DMA_L4_VALID_PDMA BIT(30) /* when PDMA is used */
		#define RX_DMA_FPORT_SHIFT 19
		#define RX_DMA_FPORT_MASK 0x7
		#define RX_DMA_SPECIAL_TAG BIT(22)

		/* PHY Indirect Access Control registers */
		#define MTK_PHY_IAC 0x10004
		@@ -802,6 +814,7 @@ struct mtk_soc_data {
		u32 caps;
		u32 required_clks;
		bool required_pctl;
		u8 offload_version;
		netdev_features_t hw_features;
		};

		@@ -901,6 +914,9 @@ struct mtk_eth {
		u32 tx_int_status_reg;
		u32 rx_dma_l4_valid;
		int ip_align;

		struct mtk_ppe ppe;
		struct rhashtable flow_table;
		};

		/* struct mtk_mac - the structure that holds the info about the MACs of the
		@@ -945,4 +961,9 @@ int mtk_gmac_sgmii_path_setup(struct mtk_eth *eth, int mac_id);
		int mtk_gmac_gephy_path_setup(struct mtk_eth *eth, int mac_id);
		int mtk_gmac_rgmii_path_setup(struct mtk_eth *eth, int mac_id);

		int mtk_eth_offload_init(struct mtk_eth *eth);
		int mtk_eth_setup_tc(struct net_device *dev, enum tc_setup_type type,
		void *type_data);


		#endif /* MTK_ETH_H */

drivers/net/ethernet/mediatek/mtk_ppe.c

0 → 100644

+511 −0

File added.

Preview size limit exceeded, changes collapsed.