Commit c2865b11 authored by Jakub Kicinski's avatar Jakub Kicinski
Browse files

Daniel Borkmann says:

====================
pull-request: bpf-next 2023-04-13

We've added 260 non-merge commits during the last 36 day(s) which contain
a total of 356 files changed, 21786 insertions(+), 11275 deletions(-).

The main changes are:

1) Rework BPF verifier log behavior and implement it as a rotating log
   by default with the option to retain old-style fixed log behavior,
   from Andrii Nakryiko.

2) Adds support for using {FOU,GUE} encap with an ipip device operating
   in collect_md mode and add a set of BPF kfuncs for controlling encap
   params, from Christian Ehrig.

3) Allow BPF programs to detect at load time whether a particular kfunc
   exists or not, and also add support for this in light skeleton,
   from Alexei Starovoitov.

4) Optimize hashmap lookups when key size is multiple of 4,
   from Anton Protopopov.

5) Enable RCU semantics for task BPF kptrs and allow referenced kptr
   tasks to be stored in BPF maps, from David Vernet.

6) Add support for stashing local BPF kptr into a map value via
   bpf_kptr_xchg(). This is useful e.g. for rbtree node creation
   for new cgroups, from Dave Marchevsky.

7) Fix BTF handling of is_int_ptr to skip modifiers to work around
   tracing issues where a program cannot be attached, from Feng Zhou.

8) Migrate a big portion of test_verifier unit tests over to
   test_progs -a verifier_* via inline asm to ease {read,debug}ability,
   from Eduard Zingerman.

9) Several updates to the instruction-set.rst documentation
   which is subject to future IETF standardization
   (https://lwn.net/Articles/926882/), from Dave Thaler.

10) Fix BPF verifier in the __reg_bound_offset's 64->32 tnum sub-register
    known bits information propagation, from Daniel Borkmann.

11) Add skb bitfield compaction work related to BPF with the overall goal
    to make more of the sk_buff bits optional, from Jakub Kicinski.

12) BPF selftest cleanups for build id extraction which stand on its own
    from the upcoming integration work of build id into struct file object,
    from Jiri Olsa.

13) Add fixes and optimizations for xsk descriptor validation and several
    selftest improvements for xsk sockets, from Kal Conley.

14) Add BPF links for struct_ops and enable switching implementations
    of BPF TCP cong-ctls under a given name by replacing backing
    struct_ops map, from Kui-Feng Lee.

15) Remove a misleading BPF verifier env->bypass_spec_v1 check on variable
    offset stack read as earlier Spectre checks cover this,
    from Luis Gerhorst.

16) Fix issues in copy_from_user_nofault() for BPF and other tracers
    to resemble copy_from_user_nmi() from safety PoV, from Florian Lehner
    and Alexei Starovoitov.

17) Add --json-summary option to test_progs in order for CI tooling to
    ease parsing of test results, from Manu Bretelle.

18) Batch of improvements and refactoring to prep for upcoming
    bpf_local_storage conversion to bpf_mem_cache_{alloc,free} allocator,
    from Martin KaFai Lau.

19) Improve bpftool's visual program dump which produces the control
    flow graph in a DOT format by adding C source inline annotations,
    from Quentin Monnet.

20) Fix attaching fentry/fexit/fmod_ret/lsm to modules by extracting
    the module name from BTF of the target and searching kallsyms of
    the correct module, from Viktor Malik.

21) Improve BPF verifier handling of '<const> <cond> <non_const>'
    to better detect whether in particular jmp32 branches are taken,
    from Yonghong Song.

22) Allow BPF TCP cong-ctls to write app_limited of struct tcp_sock.
    A built-in cc or one from a kernel module is already able to write
    to app_limited, from Yixin Shen.

Conflicts:

Documentation/bpf/bpf_devel_QA.rst
  b7abcd9c ("bpf, doc: Link to submitting-patches.rst for general patch submission info")
  0f10f647 ("bpf, docs: Use internal linking for link to netdev subsystem doc")
https://lore.kernel.org/all/20230307095812.236eb1be@canb.auug.org.au/

include/net/ip_tunnels.h
  bc9d003d ("ip_tunnel: Preserve pointer const in ip_tunnel_info_opts")
  ac931d4c ("ipip,ip_tunnel,sit: Add FOU support for externally controlled ipip devices")
https://lore.kernel.org/all/20230413161235.4093777-1-broonie@kernel.org/

net/bpf/test_run.c
  e5995bc7 ("bpf, test_run: fix crashes due to XDP frame overwriting/corruption")
  294635a8 ("bpf, test_run: fix &xdp_frame misplacement for LIVE_FRAMES")
https://lore.kernel.org/all/20230320102619.05b80a98@canb.auug.org.au/
====================

Link: https://lore.kernel.org/r/20230413191525.7295-1-daniel@iogearbox.net


Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
parents 800e68c4 8c5c2a48
Loading
Loading
Loading
Loading
+12 −8
Original line number Diff line number Diff line
@@ -128,7 +128,8 @@ into the bpf-next tree will make their way into net-next tree. net and
net-next are both run by David S. Miller. From there, they will go
into the kernel mainline tree run by Linus Torvalds. To read up on the
process of net and net-next being merged into the mainline tree, see
the `netdev-FAQ`_.
the documentation on netdev subsystem at
Documentation/process/maintainer-netdev.rst.



@@ -147,7 +148,8 @@ request)::
Q: How do I indicate which tree (bpf vs. bpf-next) my patch should be applied to?
---------------------------------------------------------------------------------

A: The process is the very same as described in the `netdev-FAQ`_,
A: The process is the very same as described in the netdev subsystem
documentation at Documentation/process/maintainer-netdev.rst,
so please read up on it. The subject line must indicate whether the
patch is a fix or rather "next-like" content in order to let the
maintainers know whether it is targeted at bpf or bpf-next.
@@ -206,8 +208,9 @@ ii) run extensive BPF test suite and
Once the BPF pull request was accepted by David S. Miller, then
the patches end up in net or net-next tree, respectively, and
make their way from there further into mainline. Again, see the
`netdev-FAQ`_ for additional information e.g. on how often they are
merged to mainline.
documentation for netdev subsystem at
Documentation/process/maintainer-netdev.rst for additional information
e.g. on how often they are merged to mainline.

Q: How long do I need to wait for feedback on my BPF patches?
-------------------------------------------------------------
@@ -230,7 +233,8 @@ Q: Are patches applied to bpf-next when the merge window is open?
-----------------------------------------------------------------
A: For the time when the merge window is open, bpf-next will not be
processed. This is roughly analogous to net-next patch processing,
so feel free to read up on the `netdev-FAQ`_ about further details.
so feel free to read up on the netdev docs at
Documentation/process/maintainer-netdev.rst about further details.

During those two weeks of merge window, we might ask you to resend
your patch series once bpf-next is open again. Once Linus released
@@ -394,7 +398,8 @@ netdev kernel mailing list in Cc and ask for the fix to be queued up:
  netdev@vger.kernel.org

The process in general is the same as on netdev itself, see also the
`netdev-FAQ`_.
the documentation on networking subsystem at
Documentation/process/maintainer-netdev.rst.

Q: Do you also backport to kernels not currently maintained as stable?
----------------------------------------------------------------------
@@ -410,7 +415,7 @@ Q: The BPF patch I am about to submit needs to go to stable as well
What should I do?

A: The same rules apply as with netdev patch submissions in general, see
the `netdev-FAQ`_.
the netdev docs at Documentation/process/maintainer-netdev.rst.

Never add "``Cc: stable@vger.kernel.org``" to the patch description, but
ask the BPF maintainers to queue the patches instead. This can be done
@@ -684,7 +689,6 @@ when:


.. Links
.. _netdev-FAQ: https://www.kernel.org/doc/html/latest/process/maintainer-netdev.html
.. _selftests:
   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/bpf/

+6 −0
Original line number Diff line number Diff line
@@ -20,6 +20,12 @@ Arithmetic instructions
For CPU versions prior to 3, Clang v7.0 and later can enable ``BPF_ALU`` support with
``-Xclang -target-feature -Xclang +alu32``.  In CPU version 3, support is automatically included.

Jump instructions
=================

If ``-O0`` is used, Clang will generate the ``BPF_CALL | BPF_X | BPF_JMP`` (0x8d)
instruction, which is not supported by the Linux kernel verifier.

Atomic operations
=================

+10 −20
Original line number Diff line number Diff line
@@ -117,12 +117,7 @@ For example:
As mentioned and illustrated above, these ``struct bpf_cpumask *`` objects can
also be stored in a map and used as kptrs. If a ``struct bpf_cpumask *`` is in
a map, the reference can be removed from the map with bpf_kptr_xchg(), or
opportunistically acquired with bpf_cpumask_kptr_get():

.. kernel-doc:: kernel/bpf/cpumask.c
  :identifiers: bpf_cpumask_kptr_get

Here is an example of a ``struct bpf_cpumask *`` being retrieved from a map:
opportunistically acquired using RCU:

.. code-block:: c

@@ -144,7 +139,7 @@ Here is an example of a ``struct bpf_cpumask *`` being retrieved from a map:
	/**
	 * A simple example tracepoint program showing how a
	 * struct bpf_cpumask * kptr that is stored in a map can
	 * be acquired using the bpf_cpumask_kptr_get() kfunc.
	 * be passed to kfuncs using RCU protection.
	 */
	SEC("tp_btf/cgroup_mkdir")
	int BPF_PROG(cgrp_ancestor_example, struct cgroup *cgrp, const char *path)
@@ -158,26 +153,21 @@ Here is an example of a ``struct bpf_cpumask *`` being retrieved from a map:
		if (!v)
			return -ENOENT;

		bpf_rcu_read_lock();
		/* Acquire a reference to the bpf_cpumask * kptr that's already stored in the map. */
		kptr = bpf_cpumask_kptr_get(&v->cpumask);
		if (!kptr)
		kptr = v->cpumask;
		if (!kptr) {
			/* If no bpf_cpumask was present in the map, it's because
			 * we're racing with another CPU that removed it with
			 * bpf_kptr_xchg() between the bpf_map_lookup_elem()
			 * above, and our call to bpf_cpumask_kptr_get().
			 * bpf_cpumask_kptr_get() internally safely handles this
			 * race, and will return NULL if the cpumask is no longer
			 * present in the map by the time we invoke the kfunc.
			 * above, and our load of the pointer from the map.
			 */
			bpf_rcu_read_unlock();
			return -EBUSY;
		}

		/* Free the reference we just took above. Note that the
		 * original struct bpf_cpumask * kptr is still in the map. It will
		 * be freed either at a later time if another context deletes
		 * it from the map, or automatically by the BPF subsystem if
		 * it's still present when the map is destroyed.
		 */
		bpf_cpumask_release(kptr);
		bpf_cpumask_setall(kptr);
		bpf_rcu_read_unlock();

		return 0;
	}
+101 −28
Original line number Diff line number Diff line
@@ -11,7 +11,8 @@ Documentation conventions
=========================

For brevity, this document uses the type notion "u64", "u32", etc.
to mean an unsigned integer whose width is the specified number of bits.
to mean an unsigned integer whose width is the specified number of bits,
and "s32", etc. to mean a signed integer of the specified number of bits.

Registers and calling convention
================================
@@ -242,28 +243,58 @@ Jump instructions
otherwise identical operations.
The 'code' field encodes the operation as below:

========  =====  =========================  ============
code      value  description                notes
========  =====  =========================  ============
BPF_JA    0x00   PC += off                  BPF_JMP only
BPF_JEQ   0x10   PC += off if dst == src
BPF_JGT   0x20   PC += off if dst > src     unsigned
BPF_JGE   0x30   PC += off if dst >= src    unsigned
BPF_JSET  0x40   PC += off if dst & src
BPF_JNE   0x50   PC += off if dst != src
BPF_JSGT  0x60   PC += off if dst > src     signed
BPF_JSGE  0x70   PC += off if dst >= src    signed
BPF_CALL  0x80   function call
BPF_EXIT  0x90   function / program return  BPF_JMP only
BPF_JLT   0xa0   PC += off if dst < src     unsigned
BPF_JLE   0xb0   PC += off if dst <= src    unsigned
BPF_JSLT  0xc0   PC += off if dst < src     signed
BPF_JSLE  0xd0   PC += off if dst <= src    signed
========  =====  =========================  ============
========  =====  ===  ===========================================  =========================================
code      value  src  description                                  notes
========  =====  ===  ===========================================  =========================================
BPF_JA    0x0    0x0  PC += offset                                 BPF_JMP only
BPF_JEQ   0x1    any  PC += offset if dst == src
BPF_JGT   0x2    any  PC += offset if dst > src                    unsigned
BPF_JGE   0x3    any  PC += offset if dst >= src                   unsigned
BPF_JSET  0x4    any  PC += offset if dst & src
BPF_JNE   0x5    any  PC += offset if dst != src
BPF_JSGT  0x6    any  PC += offset if dst > src                    signed
BPF_JSGE  0x7    any  PC += offset if dst >= src                   signed
BPF_CALL  0x8    0x0  call helper function by address              see `Helper functions`_
BPF_CALL  0x8    0x1  call PC += offset                            see `Program-local functions`_
BPF_CALL  0x8    0x2  call helper function by BTF ID               see `Helper functions`_
BPF_EXIT  0x9    0x0  return                                       BPF_JMP only
BPF_JLT   0xa    any  PC += offset if dst < src                    unsigned
BPF_JLE   0xb    any  PC += offset if dst <= src                   unsigned
BPF_JSLT  0xc    any  PC += offset if dst < src                    signed
BPF_JSLE  0xd    any  PC += offset if dst <= src                   signed
========  =====  ===  ===========================================  =========================================

The eBPF program needs to store the return value into register R0 before doing a
BPF_EXIT.
``BPF_EXIT``.

Example:

``BPF_JSGE | BPF_X | BPF_JMP32`` (0x7e) means::

  if (s32)dst s>= (s32)src goto +offset

where 's>=' indicates a signed '>=' comparison.

Helper functions
~~~~~~~~~~~~~~~~

Helper functions are a concept whereby BPF programs can call into a
set of function calls exposed by the underlying platform.

Historically, each helper function was identified by an address
encoded in the imm field.  The available helper functions may differ
for each program type, but address values are unique across all program types.

Platforms that support the BPF Type Format (BTF) support identifying
a helper function by a BTF ID encoded in the imm field, where the BTF ID
identifies the helper name and type.

Program-local functions
~~~~~~~~~~~~~~~~~~~~~~~
Program-local functions are functions exposed by the same BPF program as the
caller, and are referenced by offset from the call instruction, similar to
``BPF_JA``.  A ``BPF_EXIT`` within the program-local function will return to
the caller.

Load and store instructions
===========================
@@ -385,14 +416,56 @@ and loaded back to ``R0``.
-----------------------------

Instructions with the ``BPF_IMM`` 'mode' modifier use the wide instruction
encoding for an extra imm64 value.

There is currently only one such instruction.

``BPF_LD | BPF_DW | BPF_IMM`` means::

  dst = imm64

encoding defined in `Instruction encoding`_, and use the 'src' field of the
basic instruction to hold an opcode subtype.

The following table defines a set of ``BPF_IMM | BPF_DW | BPF_LD`` instructions
with opcode subtypes in the 'src' field, using new terms such as "map"
defined further below:

=========================  ======  ===  =========================================  ===========  ==============
opcode construction        opcode  src  pseudocode                                 imm type     dst type
=========================  ======  ===  =========================================  ===========  ==============
BPF_IMM | BPF_DW | BPF_LD  0x18    0x0  dst = imm64                                integer      integer
BPF_IMM | BPF_DW | BPF_LD  0x18    0x1  dst = map_by_fd(imm)                       map fd       map
BPF_IMM | BPF_DW | BPF_LD  0x18    0x2  dst = map_val(map_by_fd(imm)) + next_imm   map fd       data pointer
BPF_IMM | BPF_DW | BPF_LD  0x18    0x3  dst = var_addr(imm)                        variable id  data pointer
BPF_IMM | BPF_DW | BPF_LD  0x18    0x4  dst = code_addr(imm)                       integer      code pointer
BPF_IMM | BPF_DW | BPF_LD  0x18    0x5  dst = map_by_idx(imm)                      map index    map
BPF_IMM | BPF_DW | BPF_LD  0x18    0x6  dst = map_val(map_by_idx(imm)) + next_imm  map index    data pointer
=========================  ======  ===  =========================================  ===========  ==============

where

* map_by_fd(imm) means to convert a 32-bit file descriptor into an address of a map (see `Maps`_)
* map_by_idx(imm) means to convert a 32-bit index into an address of a map
* map_val(map) gets the address of the first value in a given map
* var_addr(imm) gets the address of a platform variable (see `Platform Variables`_) with a given id
* code_addr(imm) gets the address of the instruction at a specified relative offset in number of (64-bit) instructions
* the 'imm type' can be used by disassemblers for display
* the 'dst type' can be used for verification and JIT compilation purposes

Maps
~~~~

Maps are shared memory regions accessible by eBPF programs on some platforms.
A map can have various semantics as defined in a separate document, and may or
may not have a single contiguous memory region, but the 'map_val(map)' is
currently only defined for maps that do have a single contiguous memory region.

Each map can have a file descriptor (fd) if supported by the platform, where
'map_by_fd(imm)' means to get the map with the specified file descriptor. Each
BPF program can also be defined to use a set of maps associated with the
program at load time, and 'map_by_idx(imm)' means to get the map with the given
index in the set associated with the BPF program containing the instruction.

Platform Variables
~~~~~~~~~~~~~~~~~~

Platform variables are memory regions, identified by integer ids, exposed by
the runtime and accessible by BPF programs on some platforms.  The
'var_addr(imm)' operation means to get the address of the memory region
identified by the given id.

Legacy BPF Packet access instructions
-------------------------------------
+47 −77
Original line number Diff line number Diff line
@@ -179,9 +179,10 @@ both are orthogonal to each other.
---------------------

The KF_RELEASE flag is used to indicate that the kfunc releases the pointer
passed in to it. There can be only one referenced pointer that can be passed in.
All copies of the pointer being released are invalidated as a result of invoking
kfunc with this flag.
passed in to it. There can be only one referenced pointer that can be passed
in. All copies of the pointer being released are invalidated as a result of
invoking kfunc with this flag. KF_RELEASE kfuncs automatically receive the
protection afforded by the KF_TRUSTED_ARGS flag described below.

2.4.4 KF_KPTR_GET flag
----------------------
@@ -470,7 +471,7 @@ struct_ops callback arg. For example:
		struct task_struct *acquired;

		acquired = bpf_task_acquire(task);

		if (acquired)
			/*
			 * In a typical program you'd do something like store
			 * the task in a map, and the map will automatically
@@ -480,6 +481,43 @@ struct_ops callback arg. For example:
		return 0;
	}


References acquired on ``struct task_struct *`` objects are RCU protected.
Therefore, when in an RCU read region, you can obtain a pointer to a task
embedded in a map value without having to acquire a reference:

.. code-block:: c

	#define private(name) SEC(".data." #name) __hidden __attribute__((aligned(8)))
	private(TASK) static struct task_struct *global;

	/**
	 * A trivial example showing how to access a task stored
	 * in a map using RCU.
	 */
	SEC("tp_btf/task_newtask")
	int BPF_PROG(task_rcu_read_example, struct task_struct *task, u64 clone_flags)
	{
		struct task_struct *local_copy;

		bpf_rcu_read_lock();
		local_copy = global;
		if (local_copy)
			/*
			 * We could also pass local_copy to kfuncs or helper functions here,
			 * as we're guaranteed that local_copy will be valid until we exit
			 * the RCU read region below.
			 */
			bpf_printk("Global task %s is valid", local_copy->comm);
		else
			bpf_printk("No global task found");
		bpf_rcu_read_unlock();

		/* At this point we can no longer reference local_copy. */

		return 0;
	}

----

A BPF program can also look up a task from a pid. This can be useful if the
@@ -534,74 +572,6 @@ bpf_task_release() respectively, so we won't provide examples for them.

----

You may also acquire a reference to a ``struct cgroup`` kptr that's already
stored in a map using bpf_cgroup_kptr_get():

.. kernel-doc:: kernel/bpf/helpers.c
   :identifiers: bpf_cgroup_kptr_get

Here's an example of how it can be used:

.. code-block:: c

	/* struct containing the struct task_struct kptr which is actually stored in the map. */
	struct __cgroups_kfunc_map_value {
		struct cgroup __kptr * cgroup;
	};

	/* The map containing struct __cgroups_kfunc_map_value entries. */
	struct {
		__uint(type, BPF_MAP_TYPE_HASH);
		__type(key, int);
		__type(value, struct __cgroups_kfunc_map_value);
		__uint(max_entries, 1);
	} __cgroups_kfunc_map SEC(".maps");

	/* ... */

	/**
	 * A simple example tracepoint program showing how a
	 * struct cgroup kptr that is stored in a map can
	 * be acquired using the bpf_cgroup_kptr_get() kfunc.
	 */
	 SEC("tp_btf/cgroup_mkdir")
	 int BPF_PROG(cgroup_kptr_get_example, struct cgroup *cgrp, const char *path)
	 {
		struct cgroup *kptr;
		struct __cgroups_kfunc_map_value *v;
		s32 id = cgrp->self.id;

		/* Assume a cgroup kptr was previously stored in the map. */
		v = bpf_map_lookup_elem(&__cgroups_kfunc_map, &id);
		if (!v)
			return -ENOENT;

		/* Acquire a reference to the cgroup kptr that's already stored in the map. */
		kptr = bpf_cgroup_kptr_get(&v->cgroup);
		if (!kptr)
			/* If no cgroup was present in the map, it's because
			 * we're racing with another CPU that removed it with
			 * bpf_kptr_xchg() between the bpf_map_lookup_elem()
			 * above, and our call to bpf_cgroup_kptr_get().
			 * bpf_cgroup_kptr_get() internally safely handles this
			 * race, and will return NULL if the task is no longer
			 * present in the map by the time we invoke the kfunc.
			 */
			return -EBUSY;

		/* Free the reference we just took above. Note that the
		 * original struct cgroup kptr is still in the map. It will
		 * be freed either at a later time if another context deletes
		 * it from the map, or automatically by the BPF subsystem if
		 * it's still present when the map is destroyed.
		 */
		bpf_cgroup_release(kptr);

		return 0;
        }

----

Other kfuncs available for interacting with ``struct cgroup *`` objects are
bpf_cgroup_ancestor() and bpf_cgroup_from_id(), allowing callers to access
the ancestor of a cgroup and find a cgroup by its ID, respectively. Both
Loading