Commit de428733 authored by Jakub Kicinski's avatar Jakub Kicinski
Browse files

Daniel Borkmann says:

====================
pull-request: bpf-next 2023-02-11

We've added 96 non-merge commits during the last 14 day(s) which contain
a total of 152 files changed, 4884 insertions(+), 962 deletions(-).

There is a minor conflict in drivers/net/ethernet/intel/ice/ice_main.c
between commit 5b246e53 ("ice: split probe into smaller functions")
from the net-next tree and commit 66c0e13a ("drivers: net: turn on
XDP features") from the bpf-next tree. Remove the hunk given ice_cfg_netdev()
is otherwise there a 2nd time, and add XDP features to the existing
ice_cfg_netdev() one:

        [...]
        ice_set_netdev_features(netdev);
        netdev->xdp_features = NETDEV_XDP_ACT_BASIC | NETDEV_XDP_ACT_REDIRECT |
                               NETDEV_XDP_ACT_XSK_ZEROCOPY;
        ice_set_ops(netdev);
        [...]

Stephen's merge conflict mail:
https://lore.kernel.org/bpf/20230207101951.21a114fa@canb.auug.org.au/

The main changes are:

1) Add support for BPF trampoline on s390x which finally allows to remove many
   test cases from the BPF CI's DENYLIST.s390x, from Ilya Leoshkevich.

2) Add multi-buffer XDP support to ice driver, from Maciej Fijalkowski.

3) Add capability to export the XDP features supported by the NIC.
   Along with that, add a XDP compliance test tool,
   from Lorenzo Bianconi & Marek Majtyka.

4) Add __bpf_kfunc tag for marking kernel functions as kfuncs,
   from David Vernet.

5) Add a deep dive documentation about the verifier's register
   liveness tracking algorithm, from Eduard Zingerman.

6) Fix and follow-up cleanups for resolve_btfids to be compiled
   as a host program to avoid cross compile issues,
   from Jiri Olsa & Ian Rogers.

7) Batch of fixes to the BPF selftest for xdp_hw_metadata which resulted
   when testing on different NICs, from Jesper Dangaard Brouer.

8) Fix libbpf to better detect kernel version code on Debian, from Hao Xiang.

9) Extend libbpf to add an option for when the perf buffer should
   wake up, from Jon Doron.

10) Follow-up fix on xdp_metadata selftest to just consume on TX
    completion, from Stanislav Fomichev.

11) Extend the kfuncs.rst document with description on kfunc
    lifecycle & stability expectations, from David Vernet.

12) Fix bpftool prog profile to skip attaching to offline CPUs,
    from Tonghao Zhang.

====================

Link: https://lore.kernel.org/r/20230211002037.8489-1-daniel@iogearbox.net


Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
parents d12f9ad0 17bcd27a
Loading
Loading
Loading
Loading
+18 −7
Original line number Diff line number Diff line
@@ -208,6 +208,10 @@ data structures and compile with kernel internal headers. Both of these
kernel internals are subject to change and can break with newer kernels
such that the program needs to be adapted accordingly.

New BPF functionality is generally added through the use of kfuncs instead of
new helpers. Kfuncs are not considered part of the stable API, and have their own
lifecycle expectations as described in :ref:`BPF_kfunc_lifecycle_expectations`.

Q: Are tracepoints part of the stable ABI?
------------------------------------------
A: NO. Tracepoints are tied to internal implementation details hence they are
@@ -236,8 +240,8 @@ A: NO. Classic BPF programs are converted into extend BPF instructions.

Q: Can BPF call arbitrary kernel functions?
-------------------------------------------
A: NO. BPF programs can only call a set of helper functions which
is defined for every program type.
A: NO. BPF programs can only call specific functions exposed as BPF helpers or
kfuncs. The set of available functions is defined for every program type.

Q: Can BPF overwrite arbitrary kernel memory?
---------------------------------------------
@@ -263,7 +267,12 @@ Q: New functionality via kernel modules?
Q: Can BPF functionality such as new program or map types, new
helpers, etc be added out of kernel module code?

A: NO.
A: Yes, through kfuncs and kptrs

The core BPF functionality such as program types, maps and helpers cannot be
added to by modules. However, modules can expose functionality to BPF programs
by exporting kfuncs (which may return pointers to module-internal data
structures as kptrs).

Q: Directly calling kernel function is an ABI?
----------------------------------------------
@@ -278,7 +287,8 @@ kernel functions have already been used by other kernel tcp
cc (congestion-control) implementations.  If any of these kernel
functions has changed, both the in-tree and out-of-tree kernel tcp cc
implementations have to be changed.  The same goes for the bpf
programs and they have to be adjusted accordingly.
programs and they have to be adjusted accordingly. See
:ref:`BPF_kfunc_lifecycle_expectations` for details.

Q: Attaching to arbitrary kernel functions is an ABI?
-----------------------------------------------------
@@ -340,6 +350,7 @@ compatibility for these features?

A: NO.

Unlike map value types, there are no stability guarantees for this case. The
whole API to work with allocated objects and any support for special fields
inside them is unstable (since it is exposed through kfuncs).
Unlike map value types, the API to work with allocated objects and any support
for special fields inside them is exposed through kfuncs, and thus has the same
lifecycle expectations as the kfuncs themselves. See
:ref:`BPF_kfunc_lifecycle_expectations` for details.
+84 −36
Original line number Diff line number Diff line
@@ -7,6 +7,11 @@ eBPF Instruction Set Specification, v1.0

This document specifies version 1.0 of the eBPF instruction set.

Documentation conventions
=========================

For brevity, this document uses the type notion "u64", "u32", etc.
to mean an unsigned integer whose width is the specified number of bits.

Registers and calling convention
================================
@@ -30,20 +35,56 @@ Instruction encoding
eBPF has two instruction encodings:

* the basic instruction encoding, which uses 64 bits to encode an instruction
* the wide instruction encoding, which appends a second 64-bit immediate value
  (imm64) after the basic instruction for a total of 128 bits.
* the wide instruction encoding, which appends a second 64-bit immediate (i.e.,
  constant) value after the basic instruction for a total of 128 bits.

The basic instruction encoding looks as follows:
The basic instruction encoding is as follows, where MSB and LSB mean the most significant
bits and least significant bits, respectively:

=============  =======  ===============  ====================  ============
=============  =======  =======  =======  ============
32 bits (MSB)  16 bits  4 bits   4 bits   8 bits (LSB)
=============  =======  ===============  ====================  ============
immediate      offset   source register  destination register  opcode
=============  =======  ===============  ====================  ============
=============  =======  =======  =======  ============
imm            offset   src_reg  dst_reg  opcode
=============  =======  =======  =======  ============

**imm**
  signed integer immediate value

**offset**
  signed integer offset used with pointer arithmetic

**src_reg**
  the source register number (0-10), except where otherwise specified
  (`64-bit immediate instructions`_ reuse this field for other purposes)

**dst_reg**
  destination register number (0-10)

**opcode**
  operation to perform

Note that most instructions do not use all of the fields.
Unused fields shall be cleared to zero.

As discussed below in `64-bit immediate instructions`_, a 64-bit immediate
instruction uses a 64-bit immediate value that is constructed as follows.
The 64 bits following the basic instruction contain a pseudo instruction
using the same format but with opcode, dst_reg, src_reg, and offset all set to zero,
and imm containing the high 32 bits of the immediate value.

=================  ==================
64 bits (MSB)      64 bits (LSB)
=================  ==================
basic instruction  pseudo instruction
=================  ==================

Thus the 64-bit immediate value is constructed as follows:

  imm64 = (next_imm << 32) | imm

where 'next_imm' refers to the imm value of the pseudo instruction
following the basic instruction.

Instruction classes
-------------------

@@ -71,27 +112,32 @@ For arithmetic and jump instructions (``BPF_ALU``, ``BPF_ALU64``, ``BPF_JMP`` an
==============  ======  =================
4 bits (MSB)    1 bit   3 bits (LSB)
==============  ======  =================
operation code  source  instruction class
code            source  instruction class
==============  ======  =================

The 4th bit encodes the source operand:
**code**
  the operation code, whose meaning varies by instruction class

  ======  =====  ========================================
  source  value  description
  ======  =====  ========================================
  BPF_K   0x00   use 32-bit immediate as source operand
  BPF_X   0x08   use 'src_reg' register as source operand
  ======  =====  ========================================
**source**
  the source operand location, which unless otherwise specified is one of:

The four MSB bits store the operation code.
  ======  =====  ==============================================
  source  value  description
  ======  =====  ==============================================
  BPF_K   0x00   use 32-bit 'imm' value as source operand
  BPF_X   0x08   use 'src_reg' register value as source operand
  ======  =====  ==============================================

**instruction class**
  the instruction class (see `Instruction classes`_)

Arithmetic instructions
-----------------------

``BPF_ALU`` uses 32-bit wide operands while ``BPF_ALU64`` uses 64-bit wide operands for
otherwise identical operations.
The 'code' field encodes the operation as below:
The 'code' field encodes the operation as below, where 'src' and 'dst' refer
to the values of the source and destination registers, respectively.

========  =====  ==========================================================
code      value  description
@@ -121,19 +167,21 @@ the destination register is unchanged whereas for ``BPF_ALU`` the upper

``BPF_ADD | BPF_X | BPF_ALU`` means::

  dst_reg = (u32) dst_reg + (u32) src_reg;
  dst = (u32) ((u32) dst + (u32) src)

where '(u32)' indicates that the upper 32 bits are zeroed.

``BPF_ADD | BPF_X | BPF_ALU64`` means::

  dst_reg = dst_reg + src_reg
  dst = dst + src

``BPF_XOR | BPF_K | BPF_ALU`` means::

  dst_reg = (u32) dst_reg ^ (u32) imm32
  dst = (u32) dst ^ (u32) imm32

``BPF_XOR | BPF_K | BPF_ALU64`` means::

  dst_reg = dst_reg ^ imm32
  dst = dst ^ imm32

Also note that the division and modulo operations are unsigned. Thus, for
``BPF_ALU``, 'imm' is first interpreted as an unsigned 32-bit value, whereas
@@ -167,11 +215,11 @@ Examples:

``BPF_ALU | BPF_TO_LE | BPF_END`` with imm = 16 means::

  dst_reg = htole16(dst_reg)
  dst = htole16(dst)

``BPF_ALU | BPF_TO_BE | BPF_END`` with imm = 64 means::

  dst_reg = htobe64(dst_reg)
  dst = htobe64(dst)

Jump instructions
-----------------
@@ -246,15 +294,15 @@ instructions that transfer data between a register and memory.

``BPF_MEM | <size> | BPF_STX`` means::

  *(size *) (dst_reg + off) = src_reg
  *(size *) (dst + offset) = src

``BPF_MEM | <size> | BPF_ST`` means::

  *(size *) (dst_reg + off) = imm32
  *(size *) (dst + offset) = imm32

``BPF_MEM | <size> | BPF_LDX`` means::

  dst_reg = *(size *) (src_reg + off)
  dst = *(size *) (src + offset)

Where size is one of: ``BPF_B``, ``BPF_H``, ``BPF_W``, or ``BPF_DW``.

@@ -288,11 +336,11 @@ BPF_XOR 0xa0 atomic xor

``BPF_ATOMIC | BPF_W  | BPF_STX`` with 'imm' = BPF_ADD means::

  *(u32 *)(dst_reg + off16) += src_reg
  *(u32 *)(dst + offset) += src

``BPF_ATOMIC | BPF_DW | BPF_STX`` with 'imm' = BPF ADD means::

  *(u64 *)(dst_reg + off16) += src_reg
  *(u64 *)(dst + offset) += src

In addition to the simple atomic operations, there also is a modifier and
two complex atomic operations:
@@ -307,16 +355,16 @@ BPF_CMPXCHG 0xf0 | BPF_FETCH atomic compare and exchange

The ``BPF_FETCH`` modifier is optional for simple atomic operations, and
always set for the complex atomic operations.  If the ``BPF_FETCH`` flag
is set, then the operation also overwrites ``src_reg`` with the value that
is set, then the operation also overwrites ``src`` with the value that
was in memory before it was modified.

The ``BPF_XCHG`` operation atomically exchanges ``src_reg`` with the value
addressed by ``dst_reg + off``.
The ``BPF_XCHG`` operation atomically exchanges ``src`` with the value
addressed by ``dst + offset``.

The ``BPF_CMPXCHG`` operation atomically compares the value addressed by
``dst_reg + off`` with ``R0``. If they match, the value addressed by
``dst_reg + off`` is replaced with ``src_reg``. In either case, the
value that was at ``dst_reg + off`` before the operation is zero-extended
``dst + offset`` with ``R0``. If they match, the value addressed by
``dst + offset`` is replaced with ``src``. In either case, the
value that was at ``dst + offset`` before the operation is zero-extended
and loaded back to ``R0``.

64-bit immediate instructions
@@ -329,7 +377,7 @@ There is currently only one such instruction.

``BPF_LD | BPF_DW | BPF_IMM`` means::

  dst_reg = imm64
  dst = imm64


Legacy BPF Packet access instructions
+137 −8
Original line number Diff line number Diff line
@@ -13,7 +13,7 @@ BPF Kernel Functions or more commonly known as kfuncs are functions in the Linux
kernel which are exposed for use by BPF programs. Unlike normal BPF helpers,
kfuncs do not have a stable interface and can change from one kernel release to
another. Hence, BPF programs need to be updated in response to changes in the
kernel.
kernel. See :ref:`BPF_kfunc_lifecycle_expectations` for more information.

2. Defining a kfunc
===================
@@ -41,7 +41,7 @@ An example is given below::
        __diag_ignore_all("-Wmissing-prototypes",
                          "Global kfuncs as their definitions will be in BTF");

        struct task_struct *bpf_find_get_task_by_vpid(pid_t nr)
        __bpf_kfunc struct task_struct *bpf_find_get_task_by_vpid(pid_t nr)
        {
                return find_get_task_by_vpid(nr);
        }
@@ -66,7 +66,7 @@ kfunc with a __tag, where tag may be one of the supported annotations.
This annotation is used to indicate a memory and size pair in the argument list.
An example is given below::

        void bpf_memzero(void *mem, int mem__sz)
        __bpf_kfunc void bpf_memzero(void *mem, int mem__sz)
        {
        ...
        }
@@ -86,7 +86,7 @@ safety of the program.

An example is given below::

        void *bpf_obj_new(u32 local_type_id__k, ...)
        __bpf_kfunc void *bpf_obj_new(u32 local_type_id__k, ...)
        {
        ...
        }
@@ -125,6 +125,20 @@ flags on a set of kfuncs as follows::
This set encodes the BTF ID of each kfunc listed above, and encodes the flags
along with it. Ofcourse, it is also allowed to specify no flags.

kfunc definitions should also always be annotated with the ``__bpf_kfunc``
macro. This prevents issues such as the compiler inlining the kfunc if it's a
static kernel function, or the function being elided in an LTO build as it's
not used in the rest of the kernel. Developers should not manually add
annotations to their kfunc to prevent these issues. If an annotation is
required to prevent such an issue with your kfunc, it is a bug and should be
added to the definition of the macro so that other kfuncs are similarly
protected. An example is given below::

        __bpf_kfunc struct task_struct *bpf_get_task_pid(s32 pid)
        {
        ...
        }

2.4.1 KF_ACQUIRE flag
---------------------

@@ -224,6 +238,28 @@ single argument which must be a trusted argument or a MEM_RCU pointer.
The argument may have reference count of 0 and the kfunc must take this
into consideration.

.. _KF_deprecated_flag:

2.4.9 KF_DEPRECATED flag
------------------------

The KF_DEPRECATED flag is used for kfuncs which are scheduled to be
changed or removed in a subsequent kernel release. A kfunc that is
marked with KF_DEPRECATED should also have any relevant information
captured in its kernel doc. Such information typically includes the
kfunc's expected remaining lifespan, a recommendation for new
functionality that can replace it if any is available, and possibly a
rationale for why it is being removed.

Note that while on some occasions, a KF_DEPRECATED kfunc may continue to be
supported and have its KF_DEPRECATED flag removed, it is likely to be far more
difficult to remove a KF_DEPRECATED flag after it's been added than it is to
prevent it from being added in the first place. As described in
:ref:`BPF_kfunc_lifecycle_expectations`, users that rely on specific kfuncs are
encouraged to make their use-cases known as early as possible, and participate
in upstream discussions regarding whether to keep, change, deprecate, or remove
those kfuncs if and when such discussions occur.

2.5 Registering the kfuncs
--------------------------

@@ -290,14 +326,107 @@ In order to accommodate such requirements, the verifier will enforce strict
PTR_TO_BTF_ID type matching if two types have the exact same name, with one
being suffixed with ``___init``.

3. Core kfuncs
.. _BPF_kfunc_lifecycle_expectations:

3. kfunc lifecycle expectations
===============================

kfuncs provide a kernel <-> kernel API, and thus are not bound by any of the
strict stability restrictions associated with kernel <-> user UAPIs. This means
they can be thought of as similar to EXPORT_SYMBOL_GPL, and can therefore be
modified or removed by a maintainer of the subsystem they're defined in when
it's deemed necessary.

Like any other change to the kernel, maintainers will not change or remove a
kfunc without having a reasonable justification.  Whether or not they'll choose
to change a kfunc will ultimately depend on a variety of factors, such as how
widely used the kfunc is, how long the kfunc has been in the kernel, whether an
alternative kfunc exists, what the norm is in terms of stability for the
subsystem in question, and of course what the technical cost is of continuing
to support the kfunc.

There are several implications of this:

a) kfuncs that are widely used or have been in the kernel for a long time will
   be more difficult to justify being changed or removed by a maintainer. In
   other words, kfuncs that are known to have a lot of users and provide
   significant value provide stronger incentives for maintainers to invest the
   time and complexity in supporting them. It is therefore important for
   developers that are using kfuncs in their BPF programs to communicate and
   explain how and why those kfuncs are being used, and to participate in
   discussions regarding those kfuncs when they occur upstream.

b) Unlike regular kernel symbols marked with EXPORT_SYMBOL_GPL, BPF programs
   that call kfuncs are generally not part of the kernel tree. This means that
   refactoring cannot typically change callers in-place when a kfunc changes,
   as is done for e.g. an upstreamed driver being updated in place when a
   kernel symbol is changed.

   Unlike with regular kernel symbols, this is expected behavior for BPF
   symbols, and out-of-tree BPF programs that use kfuncs should be considered
   relevant to discussions and decisions around modifying and removing those
   kfuncs. The BPF community will take an active role in participating in
   upstream discussions when necessary to ensure that the perspectives of such
   users are taken into account.

c) A kfunc will never have any hard stability guarantees. BPF APIs cannot and
   will not ever hard-block a change in the kernel purely for stability
   reasons. That being said, kfuncs are features that are meant to solve
   problems and provide value to users. The decision of whether to change or
   remove a kfunc is a multivariate technical decision that is made on a
   case-by-case basis, and which is informed by data points such as those
   mentioned above. It is expected that a kfunc being removed or changed with
   no warning will not be a common occurrence or take place without sound
   justification, but it is a possibility that must be accepted if one is to
   use kfuncs.

3.1 kfunc deprecation
---------------------

As described above, while sometimes a maintainer may find that a kfunc must be
changed or removed immediately to accommodate some changes in their subsystem,
usually kfuncs will be able to accommodate a longer and more measured
deprecation process. For example, if a new kfunc comes along which provides
superior functionality to an existing kfunc, the existing kfunc may be
deprecated for some period of time to allow users to migrate their BPF programs
to use the new one. Or, if a kfunc has no known users, a decision may be made
to remove the kfunc (without providing an alternative API) after some
deprecation period so as to provide users with a window to notify the kfunc
maintainer if it turns out that the kfunc is actually being used.

It's expected that the common case will be that kfuncs will go through a
deprecation period rather than being changed or removed without warning. As
described in :ref:`KF_deprecated_flag`, the kfunc framework provides the
KF_DEPRECATED flag to kfunc developers to signal to users that a kfunc has been
deprecated. Once a kfunc has been marked with KF_DEPRECATED, the following
procedure is followed for removal:

1. Any relevant information for deprecated kfuncs is documented in the kfunc's
   kernel docs. This documentation will typically include the kfunc's expected
   remaining lifespan, a recommendation for new functionality that can replace
   the usage of the deprecated function (or an explanation as to why no such
   replacement exists), etc.

2. The deprecated kfunc is kept in the kernel for some period of time after it
   was first marked as deprecated. This time period will be chosen on a
   case-by-case basis, and will typically depend on how widespread the use of
   the kfunc is, how long it has been in the kernel, and how hard it is to move
   to alternatives. This deprecation time period is "best effort", and as
   described :ref:`above<BPF_kfunc_lifecycle_expectations>`, circumstances may
   sometimes dictate that the kfunc be removed before the full intended
   deprecation period has elapsed.

3. After the deprecation period the kfunc will be removed. At this point, BPF
   programs calling the kfunc will be rejected by the verifier.

4. Core kfuncs
==============

The BPF subsystem provides a number of "core" kfuncs that are potentially
applicable to a wide variety of different possible use cases and programs.
Those kfuncs are documented here.

3.1 struct task_struct * kfuncs
4.1 struct task_struct * kfuncs
-------------------------------

There are a number of kfuncs that allow ``struct task_struct *`` objects to be
@@ -373,7 +502,7 @@ Here is an example of it being used:
		return 0;
	}

3.2 struct cgroup * kfuncs
4.2 struct cgroup * kfuncs
--------------------------

``struct cgroup *`` objects also have acquire and release functions:
@@ -488,7 +617,7 @@ the verifier. bpf_cgroup_ancestor() can be used as follows:
		return 0;
	}

3.3 struct cpumask * kfuncs
4.3 struct cpumask * kfuncs
---------------------------

BPF provides a set of kfuncs that can be used to query, allocate, mutate, and
+3 −3
Original line number Diff line number Diff line
@@ -83,8 +83,8 @@ This prevents from accidentally exporting a symbol, that is not supposed
to be a part of ABI what, in turn, improves both libbpf developer- and
user-experiences.

ABI versionning
---------------
ABI versioning
--------------

To make future ABI extensions possible libbpf ABI is versioned.
Versioning is implemented by ``libbpf.map`` version script that is
@@ -148,7 +148,7 @@ API documentation convention
The libbpf API is documented via comments above definitions in
header files. These comments can be rendered by doxygen and sphinx
for well organized html output. This section describes the
convention in which these comments should be formated.
convention in which these comments should be formatted.

Here is an example from btf.h:

+1 −1
Original line number Diff line number Diff line
@@ -178,7 +178,7 @@ The following code snippet shows how to update an XSKMAP with an XSK entry.

For an example on how create AF_XDP sockets, please see the AF_XDP-example and
AF_XDP-forwarding programs in the `bpf-examples`_ directory in the `libxdp`_ repository.
For a detailed explaination of the AF_XDP interface please see:
For a detailed explanation of the AF_XDP interface please see:

- `libxdp-readme`_.
- `AF_XDP`_ kernel documentation.
Loading