Daniel Borkmann says: (c2865b11) · Commits · EulixOS / Software / Kernel

Documentation/bpf/bpf_devel_QA.rst

+12 −8

Original line number	Diff line number	Diff line
		@@ -128,7 +128,8 @@ into the bpf-next tree will make their way into net-next tree. net and
		net-next are both run by David S. Miller. From there, they will go
		into the kernel mainline tree run by Linus Torvalds. To read up on the
		process of net and net-next being merged into the mainline tree, see
		the `netdev-FAQ`_.
		the documentation on netdev subsystem at
		Documentation/process/maintainer-netdev.rst.



		@@ -147,7 +148,8 @@ request)::
		Q: How do I indicate which tree (bpf vs. bpf-next) my patch should be applied to?
		---------------------------------------------------------------------------------

		A: The process is the very same as described in the `netdev-FAQ`_,
		A: The process is the very same as described in the netdev subsystem
		documentation at Documentation/process/maintainer-netdev.rst,
		so please read up on it. The subject line must indicate whether the
		patch is a fix or rather "next-like" content in order to let the
		maintainers know whether it is targeted at bpf or bpf-next.
		@@ -206,8 +208,9 @@ ii) run extensive BPF test suite and
		Once the BPF pull request was accepted by David S. Miller, then
		the patches end up in net or net-next tree, respectively, and
		make their way from there further into mainline. Again, see the
		`netdev-FAQ`_ for additional information e.g. on how often they are
		merged to mainline.
		documentation for netdev subsystem at
		Documentation/process/maintainer-netdev.rst for additional information
		e.g. on how often they are merged to mainline.

		Q: How long do I need to wait for feedback on my BPF patches?
		-------------------------------------------------------------
		@@ -230,7 +233,8 @@ Q: Are patches applied to bpf-next when the merge window is open?
		-----------------------------------------------------------------
		A: For the time when the merge window is open, bpf-next will not be
		processed. This is roughly analogous to net-next patch processing,
		so feel free to read up on the `netdev-FAQ`_ about further details.
		so feel free to read up on the netdev docs at
		Documentation/process/maintainer-netdev.rst about further details.

		During those two weeks of merge window, we might ask you to resend
		your patch series once bpf-next is open again. Once Linus released
		@@ -394,7 +398,8 @@ netdev kernel mailing list in Cc and ask for the fix to be queued up:
		netdev@vger.kernel.org

		The process in general is the same as on netdev itself, see also the
		`netdev-FAQ`_.
		the documentation on networking subsystem at
		Documentation/process/maintainer-netdev.rst.

		Q: Do you also backport to kernels not currently maintained as stable?
		----------------------------------------------------------------------
		@@ -410,7 +415,7 @@ Q: The BPF patch I am about to submit needs to go to stable as well
		What should I do?

		A: The same rules apply as with netdev patch submissions in general, see
		the `netdev-FAQ`_.
		the netdev docs at Documentation/process/maintainer-netdev.rst.

		Never add "``Cc: stable@vger.kernel.org``" to the patch description, but
		ask the BPF maintainers to queue the patches instead. This can be done
		@@ -684,7 +689,6 @@ when:


		.. Links
		.. _netdev-FAQ: https://www.kernel.org/doc/html/latest/process/maintainer-netdev.html
		.. _selftests:
		https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/bpf/

Documentation/bpf/clang-notes.rst

+6 −0

Original line number	Diff line number	Diff line
		@@ -20,6 +20,12 @@ Arithmetic instructions
		For CPU versions prior to 3, Clang v7.0 and later can enable ``BPF_ALU`` support with
		``-Xclang -target-feature -Xclang +alu32``. In CPU version 3, support is automatically included.

		Jump instructions
		=================

		If ``-O0`` is used, Clang will generate the ``BPF_CALL \| BPF_X \| BPF_JMP`` (0x8d)
		instruction, which is not supported by the Linux kernel verifier.

		Atomic operations
		=================

Documentation/bpf/cpumasks.rst

+10 −20

Original line number	Diff line number	Diff line
		@@ -117,12 +117,7 @@ For example:
		As mentioned and illustrated above, these ``struct bpf_cpumask *`` objects can
		also be stored in a map and used as kptrs. If a ``struct bpf_cpumask *`` is in
		a map, the reference can be removed from the map with bpf_kptr_xchg(), or
		opportunistically acquired with bpf_cpumask_kptr_get():

		.. kernel-doc:: kernel/bpf/cpumask.c
		:identifiers: bpf_cpumask_kptr_get

		Here is an example of a ``struct bpf_cpumask *`` being retrieved from a map:
		opportunistically acquired using RCU:

		.. code-block:: c

		@@ -144,7 +139,7 @@ Here is an example of a ``struct bpf_cpumask *`` being retrieved from a map:
		/**
		* A simple example tracepoint program showing how a
		* struct bpf_cpumask * kptr that is stored in a map can
		* be acquired using the bpf_cpumask_kptr_get() kfunc.
		* be passed to kfuncs using RCU protection.
		*/
		SEC("tp_btf/cgroup_mkdir")
		int BPF_PROG(cgrp_ancestor_example, struct cgroup cgrp, const char path)
		@@ -158,26 +153,21 @@ Here is an example of a ``struct bpf_cpumask *`` being retrieved from a map:
		if (!v)
		return -ENOENT;

		bpf_rcu_read_lock();
		/* Acquire a reference to the bpf_cpumask * kptr that's already stored in the map. */
		kptr = bpf_cpumask_kptr_get(&v->cpumask);
		if (!kptr)
		kptr = v->cpumask;
		if (!kptr) {
		/* If no bpf_cpumask was present in the map, it's because
		* we're racing with another CPU that removed it with
		* bpf_kptr_xchg() between the bpf_map_lookup_elem()
		* above, and our call to bpf_cpumask_kptr_get().
		* bpf_cpumask_kptr_get() internally safely handles this
		* race, and will return NULL if the cpumask is no longer
		* present in the map by the time we invoke the kfunc.
		* above, and our load of the pointer from the map.
		*/
		bpf_rcu_read_unlock();
		return -EBUSY;
		}

		/* Free the reference we just took above. Note that the
		* original struct bpf_cpumask * kptr is still in the map. It will
		* be freed either at a later time if another context deletes
		* it from the map, or automatically by the BPF subsystem if
		* it's still present when the map is destroyed.
		*/
		bpf_cpumask_release(kptr);
		bpf_cpumask_setall(kptr);
		bpf_rcu_read_unlock();

		return 0;
		}

Documentation/bpf/instruction-set.rst

+101 −28

Original line number	Diff line number	Diff line
		@@ -11,7 +11,8 @@ Documentation conventions
		=========================

		For brevity, this document uses the type notion "u64", "u32", etc.
		to mean an unsigned integer whose width is the specified number of bits.
		to mean an unsigned integer whose width is the specified number of bits,
		and "s32", etc. to mean a signed integer of the specified number of bits.

		Registers and calling convention
		================================
		@@ -242,28 +243,58 @@ Jump instructions
		otherwise identical operations.
		The 'code' field encodes the operation as below:

		======== ===== ========================= ============
		code value description notes
		======== ===== ========================= ============
		BPF_JA 0x00 PC += off BPF_JMP only
		BPF_JEQ 0x10 PC += off if dst == src
		BPF_JGT 0x20 PC += off if dst > src unsigned
		BPF_JGE 0x30 PC += off if dst >= src unsigned
		BPF_JSET 0x40 PC += off if dst & src
		BPF_JNE 0x50 PC += off if dst != src
		BPF_JSGT 0x60 PC += off if dst > src signed
		BPF_JSGE 0x70 PC += off if dst >= src signed
		BPF_CALL 0x80 function call
		BPF_EXIT 0x90 function / program return BPF_JMP only
		BPF_JLT 0xa0 PC += off if dst < src unsigned
		BPF_JLE 0xb0 PC += off if dst <= src unsigned
		BPF_JSLT 0xc0 PC += off if dst < src signed
		BPF_JSLE 0xd0 PC += off if dst <= src signed
		======== ===== ========================= ============
		======== ===== === =========================================== =========================================
		code value src description notes
		======== ===== === =========================================== =========================================
		BPF_JA 0x0 0x0 PC += offset BPF_JMP only
		BPF_JEQ 0x1 any PC += offset if dst == src
		BPF_JGT 0x2 any PC += offset if dst > src unsigned
		BPF_JGE 0x3 any PC += offset if dst >= src unsigned
		BPF_JSET 0x4 any PC += offset if dst & src
		BPF_JNE 0x5 any PC += offset if dst != src
		BPF_JSGT 0x6 any PC += offset if dst > src signed
		BPF_JSGE 0x7 any PC += offset if dst >= src signed
		BPF_CALL 0x8 0x0 call helper function by address see `Helper functions`_
		BPF_CALL 0x8 0x1 call PC += offset see `Program-local functions`_
		BPF_CALL 0x8 0x2 call helper function by BTF ID see `Helper functions`_
		BPF_EXIT 0x9 0x0 return BPF_JMP only
		BPF_JLT 0xa any PC += offset if dst < src unsigned
		BPF_JLE 0xb any PC += offset if dst <= src unsigned
		BPF_JSLT 0xc any PC += offset if dst < src signed
		BPF_JSLE 0xd any PC += offset if dst <= src signed
		======== ===== === =========================================== =========================================

		The eBPF program needs to store the return value into register R0 before doing a
		BPF_EXIT.
		``BPF_EXIT``.

		Example:

		``BPF_JSGE \| BPF_X \| BPF_JMP32`` (0x7e) means::

		if (s32)dst s>= (s32)src goto +offset

		where 's>=' indicates a signed '>=' comparison.

		Helper functions
		~~~~~~~~~~~~~~~~

		Helper functions are a concept whereby BPF programs can call into a
		set of function calls exposed by the underlying platform.

		Historically, each helper function was identified by an address
		encoded in the imm field. The available helper functions may differ
		for each program type, but address values are unique across all program types.

		Platforms that support the BPF Type Format (BTF) support identifying
		a helper function by a BTF ID encoded in the imm field, where the BTF ID
		identifies the helper name and type.

		Program-local functions
		~~~~~~~~~~~~~~~~~~~~~~~
		Program-local functions are functions exposed by the same BPF program as the
		caller, and are referenced by offset from the call instruction, similar to
		``BPF_JA``. A ``BPF_EXIT`` within the program-local function will return to
		the caller.

		Load and store instructions
		===========================
		@@ -385,14 +416,56 @@ and loaded back to ``R0``.
		-----------------------------

		Instructions with the ``BPF_IMM`` 'mode' modifier use the wide instruction
		encoding for an extra imm64 value.

		There is currently only one such instruction.

		``BPF_LD \| BPF_DW \| BPF_IMM`` means::

		dst = imm64

		encoding defined in `Instruction encoding`_, and use the 'src' field of the
		basic instruction to hold an opcode subtype.

		The following table defines a set of ``BPF_IMM \| BPF_DW \| BPF_LD`` instructions
		with opcode subtypes in the 'src' field, using new terms such as "map"
		defined further below:

		========================= ====== === ========================================= =========== ==============
		opcode construction opcode src pseudocode imm type dst type
		========================= ====== === ========================================= =========== ==============
		BPF_IMM \| BPF_DW \| BPF_LD 0x18 0x0 dst = imm64 integer integer
		BPF_IMM \| BPF_DW \| BPF_LD 0x18 0x1 dst = map_by_fd(imm) map fd map
		BPF_IMM \| BPF_DW \| BPF_LD 0x18 0x2 dst = map_val(map_by_fd(imm)) + next_imm map fd data pointer
		BPF_IMM \| BPF_DW \| BPF_LD 0x18 0x3 dst = var_addr(imm) variable id data pointer
		BPF_IMM \| BPF_DW \| BPF_LD 0x18 0x4 dst = code_addr(imm) integer code pointer
		BPF_IMM \| BPF_DW \| BPF_LD 0x18 0x5 dst = map_by_idx(imm) map index map
		BPF_IMM \| BPF_DW \| BPF_LD 0x18 0x6 dst = map_val(map_by_idx(imm)) + next_imm map index data pointer
		========================= ====== === ========================================= =========== ==============

		where

		* map_by_fd(imm) means to convert a 32-bit file descriptor into an address of a map (see `Maps`_)
		* map_by_idx(imm) means to convert a 32-bit index into an address of a map
		* map_val(map) gets the address of the first value in a given map
		* var_addr(imm) gets the address of a platform variable (see `Platform Variables`_) with a given id
		* code_addr(imm) gets the address of the instruction at a specified relative offset in number of (64-bit) instructions
		* the 'imm type' can be used by disassemblers for display
		* the 'dst type' can be used for verification and JIT compilation purposes

		Maps
		~~~~

		Maps are shared memory regions accessible by eBPF programs on some platforms.
		A map can have various semantics as defined in a separate document, and may or
		may not have a single contiguous memory region, but the 'map_val(map)' is
		currently only defined for maps that do have a single contiguous memory region.

		Each map can have a file descriptor (fd) if supported by the platform, where
		'map_by_fd(imm)' means to get the map with the specified file descriptor. Each
		BPF program can also be defined to use a set of maps associated with the
		program at load time, and 'map_by_idx(imm)' means to get the map with the given
		index in the set associated with the BPF program containing the instruction.

		Platform Variables
		~~~~~~~~~~~~~~~~~~

		Platform variables are memory regions, identified by integer ids, exposed by
		the runtime and accessible by BPF programs on some platforms. The
		'var_addr(imm)' operation means to get the address of the memory region
		identified by the given id.

		Legacy BPF Packet access instructions
		-------------------------------------

Documentation/bpf/kfuncs.rst

+47 −77

Original line number	Diff line number	Diff line
		@@ -179,9 +179,10 @@ both are orthogonal to each other.
		---------------------

		The KF_RELEASE flag is used to indicate that the kfunc releases the pointer
		passed in to it. There can be only one referenced pointer that can be passed in.
		All copies of the pointer being released are invalidated as a result of invoking
		kfunc with this flag.
		passed in to it. There can be only one referenced pointer that can be passed
		in. All copies of the pointer being released are invalidated as a result of
		invoking kfunc with this flag. KF_RELEASE kfuncs automatically receive the
		protection afforded by the KF_TRUSTED_ARGS flag described below.

		2.4.4 KF_KPTR_GET flag
		----------------------
		@@ -470,7 +471,7 @@ struct_ops callback arg. For example:
		struct task_struct *acquired;

		acquired = bpf_task_acquire(task);

		if (acquired)
		/*
		* In a typical program you'd do something like store
		* the task in a map, and the map will automatically
		@@ -480,6 +481,43 @@ struct_ops callback arg. For example:
		return 0;
		}


		References acquired on ``struct task_struct *`` objects are RCU protected.
		Therefore, when in an RCU read region, you can obtain a pointer to a task
		embedded in a map value without having to acquire a reference:

		.. code-block:: c

		#define private(name) SEC(".data." #name) __hidden __attribute__((aligned(8)))
		private(TASK) static struct task_struct *global;

		/**
		* A trivial example showing how to access a task stored
		* in a map using RCU.
		*/
		SEC("tp_btf/task_newtask")
		int BPF_PROG(task_rcu_read_example, struct task_struct *task, u64 clone_flags)
		{
		struct task_struct *local_copy;

		bpf_rcu_read_lock();
		local_copy = global;
		if (local_copy)
		/*
		* We could also pass local_copy to kfuncs or helper functions here,
		* as we're guaranteed that local_copy will be valid until we exit
		* the RCU read region below.
		*/
		bpf_printk("Global task %s is valid", local_copy->comm);
		else
		bpf_printk("No global task found");
		bpf_rcu_read_unlock();

		/* At this point we can no longer reference local_copy. */

		return 0;
		}

		----

		A BPF program can also look up a task from a pid. This can be useful if the
		@@ -534,74 +572,6 @@ bpf_task_release() respectively, so we won't provide examples for them.

		----

		You may also acquire a reference to a ``struct cgroup`` kptr that's already
		stored in a map using bpf_cgroup_kptr_get():

		.. kernel-doc:: kernel/bpf/helpers.c
		:identifiers: bpf_cgroup_kptr_get

		Here's an example of how it can be used:

		.. code-block:: c

		/* struct containing the struct task_struct kptr which is actually stored in the map. */
		struct __cgroups_kfunc_map_value {
		struct cgroup __kptr * cgroup;
		};

		/* The map containing struct __cgroups_kfunc_map_value entries. */
		struct {
		__uint(type, BPF_MAP_TYPE_HASH);
		__type(key, int);
		__type(value, struct __cgroups_kfunc_map_value);
		__uint(max_entries, 1);
		} __cgroups_kfunc_map SEC(".maps");

		/* ... */

		/**
		* A simple example tracepoint program showing how a
		* struct cgroup kptr that is stored in a map can
		* be acquired using the bpf_cgroup_kptr_get() kfunc.
		*/
		SEC("tp_btf/cgroup_mkdir")
		int BPF_PROG(cgroup_kptr_get_example, struct cgroup cgrp, const char path)
		{
		struct cgroup *kptr;
		struct __cgroups_kfunc_map_value *v;
		s32 id = cgrp->self.id;

		/* Assume a cgroup kptr was previously stored in the map. */
		v = bpf_map_lookup_elem(&__cgroups_kfunc_map, &id);
		if (!v)
		return -ENOENT;

		/* Acquire a reference to the cgroup kptr that's already stored in the map. */
		kptr = bpf_cgroup_kptr_get(&v->cgroup);
		if (!kptr)
		/* If no cgroup was present in the map, it's because
		* we're racing with another CPU that removed it with
		* bpf_kptr_xchg() between the bpf_map_lookup_elem()
		* above, and our call to bpf_cgroup_kptr_get().
		* bpf_cgroup_kptr_get() internally safely handles this
		* race, and will return NULL if the task is no longer
		* present in the map by the time we invoke the kfunc.
		*/
		return -EBUSY;

		/* Free the reference we just took above. Note that the
		* original struct cgroup kptr is still in the map. It will
		* be freed either at a later time if another context deletes
		* it from the map, or automatically by the BPF subsystem if
		* it's still present when the map is destroyed.
		*/
		bpf_cgroup_release(kptr);

		return 0;
		}

		----

		Other kfuncs available for interacting with ``struct cgroup *`` objects are
		bpf_cgroup_ancestor() and bpf_cgroup_from_id(), allowing callers to access
		the ancestor of a cgroup and find a cgroup by its ID, respectively. Both