Commit 74d23931 authored by Martin KaFai Lau's avatar Martin KaFai Lau
Browse files

Merge branch 'xdp: hints via kfuncs'

Stanislav Fomichev says:

====================

Please see the first patch in the series for the overall
design and use-cases.

See the following email from Toke for the per-packet metadata overhead:
https://lore.kernel.org/bpf/20221206024554.3826186-1-sdf@google.com/T/#m49d48ea08d525ec88360c7d14c4d34fb0e45e798

Recent changes:
- Keep new functions in en/xdp.c, do 'extern mlx5_xdp_metadata_ops' (Tariq)

- Remove mxbuf pointer and use xsk_buff_to_mxbuf (Tariq)

- Clarify xdp_buff vs 'XDP frame' (Jesper)

- Explicitly mention that AF_XDP RX descriptor lacks metadata size (Jesper)

- Drop libbpf_flags/xdp_flags from selftests and use ifindex instead
  of ifname (due to recent xsk.h refactoring)

Prior art (to record pros/cons for different approaches):

- Stable UAPI approach:
  https://lore.kernel.org/bpf/20220628194812.1453059-1-alexandr.lobakin@intel.com/
- Metadata+BTF_ID appoach:
  https://lore.kernel.org/bpf/166256538687.1434226.15760041133601409770.stgit@firesoul/
- v7:
  https://lore.kernel.org/bpf/20230112003230.3779451-1-sdf@google.com/
- v6:
  https://lore.kernel.org/bpf/20230104215949.529093-1-sdf@google.com/
- v5:
  https://lore.kernel.org/bpf/20221220222043.3348718-1-sdf@google.com/
- v4:
  https://lore.kernel.org/bpf/20221213023605.737383-1-sdf@google.com/
- v3:
  https://lore.kernel.org/bpf/20221206024554.3826186-1-sdf@google.com/
- v2:
  https://lore.kernel.org/bpf/20221121182552.2152891-1-sdf@google.com/
- v1:
  https://lore.kernel.org/bpf/20221115030210.3159213-1-sdf@google.com/
- kfuncs v2 RFC:
  https://lore.kernel.org/bpf/20221027200019.4106375-1-sdf@google.com/
- kfuncs v1 RFC:
  https://lore.kernel.org/bpf/20221104032532.1615099-1-sdf@google.com/



Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org

Stanislav Fomichev (13):
  bpf: Document XDP RX metadata
  bpf: Rename bpf_{prog,map}_is_dev_bound to is_offloaded
  bpf: Move offload initialization into late_initcall
  bpf: Reshuffle some parts of bpf/offload.c
  bpf: Introduce device-bound XDP programs
  selftests/bpf: Update expected test_offload.py messages
  bpf: XDP metadata RX kfuncs
  veth: Introduce veth_xdp_buff wrapper for xdp_buff
  veth: Support RX XDP metadata
  selftests/bpf: Verify xdp_metadata xdp->af_xdp path
  net/mlx4_en: Introduce wrapper for xdp_buff
  net/mlx4_en: Support RX XDP metadata
  selftests/bpf: Simple program to dump XDP RX metadata
====================

Signed-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
parents 84150795 297a3f12
Loading
Loading
Loading
Loading
+1 −0
Original line number Diff line number Diff line
@@ -120,6 +120,7 @@ Contents:
   xfrm_proc
   xfrm_sync
   xfrm_sysctl
   xdp-rx-metadata

.. only::  subproject and html

+110 −0
Original line number Diff line number Diff line
===============
XDP RX Metadata
===============

This document describes how an eXpress Data Path (XDP) program can access
hardware metadata related to a packet using a set of helper functions,
and how it can pass that metadata on to other consumers.

General Design
==============

XDP has access to a set of kfuncs to manipulate the metadata in an XDP frame.
Every device driver that wishes to expose additional packet metadata can
implement these kfuncs. The set of kfuncs is declared in ``include/net/xdp.h``
via ``XDP_METADATA_KFUNC_xxx``.

Currently, the following kfuncs are supported. In the future, as more
metadata is supported, this set will grow:

.. kernel-doc:: net/core/xdp.c
   :identifiers: bpf_xdp_metadata_rx_timestamp bpf_xdp_metadata_rx_hash

An XDP program can use these kfuncs to read the metadata into stack
variables for its own consumption. Or, to pass the metadata on to other
consumers, an XDP program can store it into the metadata area carried
ahead of the packet.

Not all kfuncs have to be implemented by the device driver; when not
implemented, the default ones that return ``-EOPNOTSUPP`` will be used.

Within an XDP frame, the metadata layout (accessed via ``xdp_buff``) is
as follows::

  +----------+-----------------+------+
  | headroom | custom metadata | data |
  +----------+-----------------+------+
             ^                 ^
             |                 |
   xdp_buff->data_meta   xdp_buff->data

An XDP program can store individual metadata items into this ``data_meta``
area in whichever format it chooses. Later consumers of the metadata
will have to agree on the format by some out of band contract (like for
the AF_XDP use case, see below).

AF_XDP
======

:doc:`af_xdp` use-case implies that there is a contract between the BPF
program that redirects XDP frames into the ``AF_XDP`` socket (``XSK``) and
the final consumer. Thus the BPF program manually allocates a fixed number of
bytes out of metadata via ``bpf_xdp_adjust_meta`` and calls a subset
of kfuncs to populate it. The userspace ``XSK`` consumer computes
``xsk_umem__get_data() - METADATA_SIZE`` to locate that metadata.
Note, ``xsk_umem__get_data`` is defined in ``libxdp`` and
``METADATA_SIZE`` is an application-specific constant (``AF_XDP`` receive
descriptor does _not_ explicitly carry the size of the metadata).

Here is the ``AF_XDP`` consumer layout (note missing ``data_meta`` pointer)::

  +----------+-----------------+------+
  | headroom | custom metadata | data |
  +----------+-----------------+------+
                               ^
                               |
                        rx_desc->address

XDP_PASS
========

This is the path where the packets processed by the XDP program are passed
into the kernel. The kernel creates the ``skb`` out of the ``xdp_buff``
contents. Currently, every driver has custom kernel code to parse
the descriptors and populate ``skb`` metadata when doing this ``xdp_buff->skb``
conversion, and the XDP metadata is not used by the kernel when building
``skbs``. However, TC-BPF programs can access the XDP metadata area using
the ``data_meta`` pointer.

In the future, we'd like to support a case where an XDP program
can override some of the metadata used for building ``skbs``.

bpf_redirect_map
================

``bpf_redirect_map`` can redirect the frame to a different device.
Some devices (like virtual ethernet links) support running a second XDP
program after the redirect. However, the final consumer doesn't have
access to the original hardware descriptor and can't access any of
the original metadata. The same applies to XDP programs installed
into devmaps and cpumaps.

This means that for redirected packets only custom metadata is
currently supported, which has to be prepared by the initial XDP program
before redirect. If the frame is eventually passed to the kernel, the
``skb`` created from such a frame won't have any hardware metadata populated
in its ``skb``. If such a packet is later redirected into an ``XSK``,
that will also only have access to the custom metadata.

bpf_tail_call
=============

Adding programs that access metadata kfuncs to the ``BPF_MAP_TYPE_PROG_ARRAY``
is currently not supported.

Example
=======

See ``tools/testing/selftests/bpf/progs/xdp_metadata.c`` and
``tools/testing/selftests/bpf/prog_tests/xdp_metadata.c`` for an example of
BPF program that handles XDP metadata.
+9 −4
Original line number Diff line number Diff line
@@ -58,9 +58,7 @@ u64 mlx4_en_get_cqe_ts(struct mlx4_cqe *cqe)
	return hi | lo;
}

void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev,
			    struct skb_shared_hwtstamps *hwts,
			    u64 timestamp)
u64 mlx4_en_get_hwtstamp(struct mlx4_en_dev *mdev, u64 timestamp)
{
	unsigned int seq;
	u64 nsec;
@@ -70,8 +68,15 @@ void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev,
		nsec = timecounter_cyc2time(&mdev->clock, timestamp);
	} while (read_seqretry(&mdev->clock_lock, seq));

	return ns_to_ktime(nsec);
}

void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev,
			    struct skb_shared_hwtstamps *hwts,
			    u64 timestamp)
{
	memset(hwts, 0, sizeof(struct skb_shared_hwtstamps));
	hwts->hwtstamp = ns_to_ktime(nsec);
	hwts->hwtstamp = mlx4_en_get_hwtstamp(mdev, timestamp);
}

/**
+6 −0
Original line number Diff line number Diff line
@@ -2889,6 +2889,11 @@ static const struct net_device_ops mlx4_netdev_ops_master = {
	.ndo_bpf		= mlx4_xdp,
};

static const struct xdp_metadata_ops mlx4_xdp_metadata_ops = {
	.xmo_rx_timestamp		= mlx4_en_xdp_rx_timestamp,
	.xmo_rx_hash			= mlx4_en_xdp_rx_hash,
};

struct mlx4_en_bond {
	struct work_struct work;
	struct mlx4_en_priv *priv;
@@ -3310,6 +3315,7 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int port,
		dev->netdev_ops = &mlx4_netdev_ops_master;
	else
		dev->netdev_ops = &mlx4_netdev_ops;
	dev->xdp_metadata_ops = &mlx4_xdp_metadata_ops;
	dev->watchdog_timeo = MLX4_EN_WATCHDOG_TIMEOUT;
	netif_set_real_num_tx_queues(dev, priv->tx_ring_num[TX]);
	netif_set_real_num_rx_queues(dev, priv->rx_ring_num);
+49 −14
Original line number Diff line number Diff line
@@ -661,9 +661,41 @@ static int check_csum(struct mlx4_cqe *cqe, struct sk_buff *skb, void *va,
#define MLX4_CQE_STATUS_IP_ANY (MLX4_CQE_STATUS_IPV4)
#endif

struct mlx4_en_xdp_buff {
	struct xdp_buff xdp;
	struct mlx4_cqe *cqe;
	struct mlx4_en_dev *mdev;
	struct mlx4_en_rx_ring *ring;
	struct net_device *dev;
};

int mlx4_en_xdp_rx_timestamp(const struct xdp_md *ctx, u64 *timestamp)
{
	struct mlx4_en_xdp_buff *_ctx = (void *)ctx;

	if (unlikely(_ctx->ring->hwtstamp_rx_filter != HWTSTAMP_FILTER_ALL))
		return -EOPNOTSUPP;

	*timestamp = mlx4_en_get_hwtstamp(_ctx->mdev,
					  mlx4_en_get_cqe_ts(_ctx->cqe));
	return 0;
}

int mlx4_en_xdp_rx_hash(const struct xdp_md *ctx, u32 *hash)
{
	struct mlx4_en_xdp_buff *_ctx = (void *)ctx;

	if (unlikely(!(_ctx->dev->features & NETIF_F_RXHASH)))
		return -EOPNOTSUPP;

	*hash = be32_to_cpu(_ctx->cqe->immed_rss_invalid);
	return 0;
}

int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int budget)
{
	struct mlx4_en_priv *priv = netdev_priv(dev);
	struct mlx4_en_xdp_buff mxbuf = {};
	int factor = priv->cqe_factor;
	struct mlx4_en_rx_ring *ring;
	struct bpf_prog *xdp_prog;
@@ -671,7 +703,6 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
	bool doorbell_pending;
	bool xdp_redir_flush;
	struct mlx4_cqe *cqe;
	struct xdp_buff xdp;
	int polled = 0;
	int index;

@@ -681,7 +712,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
	ring = priv->rx_ring[cq_ring];

	xdp_prog = rcu_dereference_bh(ring->xdp_prog);
	xdp_init_buff(&xdp, priv->frag_info[0].frag_stride, &ring->xdp_rxq);
	xdp_init_buff(&mxbuf.xdp, priv->frag_info[0].frag_stride, &ring->xdp_rxq);
	doorbell_pending = false;
	xdp_redir_flush = false;

@@ -776,24 +807,28 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
						priv->frag_info[0].frag_size,
						DMA_FROM_DEVICE);

			xdp_prepare_buff(&xdp, va - frags[0].page_offset,
					 frags[0].page_offset, length, false);
			orig_data = xdp.data;
			xdp_prepare_buff(&mxbuf.xdp, va - frags[0].page_offset,
					 frags[0].page_offset, length, true);
			orig_data = mxbuf.xdp.data;
			mxbuf.cqe = cqe;
			mxbuf.mdev = priv->mdev;
			mxbuf.ring = ring;
			mxbuf.dev = dev;

			act = bpf_prog_run_xdp(xdp_prog, &xdp);
			act = bpf_prog_run_xdp(xdp_prog, &mxbuf.xdp);

			length = xdp.data_end - xdp.data;
			if (xdp.data != orig_data) {
				frags[0].page_offset = xdp.data -
					xdp.data_hard_start;
				va = xdp.data;
			length = mxbuf.xdp.data_end - mxbuf.xdp.data;
			if (mxbuf.xdp.data != orig_data) {
				frags[0].page_offset = mxbuf.xdp.data -
					mxbuf.xdp.data_hard_start;
				va = mxbuf.xdp.data;
			}

			switch (act) {
			case XDP_PASS:
				break;
			case XDP_REDIRECT:
				if (likely(!xdp_do_redirect(dev, &xdp, xdp_prog))) {
				if (likely(!xdp_do_redirect(dev, &mxbuf.xdp, xdp_prog))) {
					ring->xdp_redirect++;
					xdp_redir_flush = true;
					frags[0].page = NULL;
Loading