Commit 7e68dd7d authored Dec 13, 2022 by Linus Torvalds

net-next

Pull networking updates from Paolo Abeni:
 "Core:

   - Allow live renaming when an interface is up

   - Add retpoline wrappers for tc, improving considerably the
     performances of complex queue discipline configurations

   - Add inet drop monitor support

   - A few GRO performance improvements

   - Add infrastructure for atomic dev stats, addressing long standing
     data races

   - De-duplicate common code between OVS and conntrack offloading
     infrastructure

   - A bunch of UBSAN_BOUNDS/FORTIFY_SOURCE improvements

   - Netfilter: introduce packet parser for tunneled packets

   - Replace IPVS timer-based estimators with kthreads to scale up the
     workload with the number of available CPUs

   - Add the helper support for connection-tracking OVS offload

  BPF:

   - Support for user defined BPF objects: the use case is to allocate
     own objects, build own object hierarchies and use the building
     blocks to build own data structures flexibly, for example, linked
     lists in BPF

   - Make cgroup local storage available to non-cgroup attached BPF
     programs

   - Avoid unnecessary deadlock detection and failures wrt BPF task
     storage helpers

   - A relevant bunch of BPF verifier fixes and improvements

   - Veristat tool improvements to support custom filtering, sorting,
     and replay of results

   - Add LLVM disassembler as default library for dumping JITed code

   - Lots of new BPF documentation for various BPF maps

   - Add bpf_rcu_read_{,un}lock() support for sleepable programs

   - Add RCU grace period chaining to BPF to wait for the completion of
     access from both sleepable and non-sleepable BPF programs

   - Add support storing struct task_struct objects as kptrs in maps

   - Improve helper UAPI by explicitly defining BPF_FUNC_xxx integer
     values

   - Add libbpf *_opts API-variants for bpf_*_get_fd_by_id() functions

  Protocols:

   - TCP: implement Protective Load Balancing across switch links

   - TCP: allow dynamically disabling TCP-MD5 static key, reverting back
     to fast[er]-path

   - UDP: Introduce optional per-netns hash lookup table

   - IPv6: simplify and cleanup sockets disposal

   - Netlink: support different type policies for each generic netlink
     operation

   - MPTCP: add MSG_FASTOPEN and FastOpen listener side support

   - MPTCP: add netlink notification support for listener sockets events

   - SCTP: add VRF support, allowing sctp sockets binding to VRF devices

   - Add bridging MAC Authentication Bypass (MAB) support

   - Extensions for Ethernet VPN bridging implementation to better
     support multicast scenarios

   - More work for Wi-Fi 7 support, comprising conversion of all the
     existing drivers to internal TX queue usage

   - IPSec: introduce a new offload type (packet offload) allowing
     complete header processing and crypto offloading

   - IPSec: extended ack support for more descriptive XFRM error
     reporting

   - RXRPC: increase SACK table size and move processing into a
     per-local endpoint kernel thread, reducing considerably the
     required locking

   - IEEE 802154: synchronous send frame and extended filtering support,
     initial support for scanning available 15.4 networks

   - Tun: bump the link speed from 10Mbps to 10Gbps

   - Tun/VirtioNet: implement UDP segmentation offload support

  Driver API:

   - PHY/SFP: improve power level switching between standard level 1 and
     the higher power levels

   - New API for netdev <-> devlink_port linkage

   - PTP: convert existing drivers to new frequency adjustment
     implementation

   - DSA: add support for rx offloading

   - Autoload DSA tagging driver when dynamically changing protocol

   - Add new PCP and APPTRUST attributes to Data Center Bridging

   - Add configuration support for 800Gbps link speed

   - Add devlink port function attribute to enable/disable RoCE and
     migratable

   - Extend devlink-rate to support strict prioriry and weighted fair
     queuing

   - Add devlink support to directly reading from region memory

   - New device tree helper to fetch MAC address from nvmem

   - New big TCP helper to simplify temporary header stripping

  New hardware / drivers:

   - Ethernet:
      - Marvel Octeon CNF95N and CN10KB Ethernet Switches
      - Marvel Prestera AC5X Ethernet Switch
      - WangXun 10 Gigabit NIC
      - Motorcomm yt8521 Gigabit Ethernet
      - Microchip ksz9563 Gigabit Ethernet Switch
      - Microsoft Azure Network Adapter
      - Linux Automation 10Base-T1L adapter

   - PHY:
      - Aquantia AQR112 and AQR412
      - Motorcomm YT8531S

   - PTP:
      - Orolia ART-CARD

   - WiFi:
      - MediaTek Wi-Fi 7 (802.11be) devices
      - RealTek rtw8821cu, rtw8822bu, rtw8822cu and rtw8723du USB
        devices

   - Bluetooth:
      - Broadcom BCM4377/4378/4387 Bluetooth chipsets
      - Realtek RTL8852BE and RTL8723DS
      - Cypress.CYW4373A0 WiFi + Bluetooth combo device

  Drivers:

   - CAN:
      - gs_usb: bus error reporting support
      - kvaser_usb: listen only and bus error reporting support

   - Ethernet NICs:
      - Intel (100G):
         - extend action skbedit to RX queue mapping
         - implement devlink-rate support
         - support direct read from memory
      - nVidia/Mellanox (mlx5):
         - SW steering improvements, increasing rules update rate
         - Support for enhanced events compression
         - extend H/W offload packet manipulation capabilities
         - implement IPSec packet offload mode
      - nVidia/Mellanox (mlx4):
         - better big TCP support
      - Netronome Ethernet NICs (nfp):
         - IPsec offload support
         - add support for multicast filter
      - Broadcom:
         - RSS and PTP support improvements
      - AMD/SolarFlare:
         - netlink extened ack improvements
         - add basic flower matches to offload, and related stats
      - Virtual NICs:
         - ibmvnic: introduce affinity hint support
      - small / embedded:
         - FreeScale fec: add initial XDP support
         - Marvel mv643xx_eth: support MII/GMII/RGMII modes for Kirkwood
         - TI am65-cpsw: add suspend/resume support
         - Mediatek MT7986: add RX wireless wthernet dispatch support
         - Realtek 8169: enable GRO software interrupt coalescing per
           default

   - Ethernet high-speed switches:
      - Microchip (sparx5):
         - add support for Sparx5 TC/flower H/W offload via VCAP
      - Mellanox mlxsw:
         - add 802.1X and MAC Authentication Bypass offload support
         - add ip6gre support

   - Embedded Ethernet switches:
      - Mediatek (mtk_eth_soc):
         - improve PCS implementation, add DSA untag support
         - enable flow offload support
      - Renesas:
         - add rswitch R-Car Gen4 gPTP support
      - Microchip (lan966x):
         - add full XDP support
         - add TC H/W offload via VCAP
         - enable PTP on bridge interfaces
      - Microchip (ksz8):
         - add MTU support for KSZ8 series

   - Qualcomm 802.11ax WiFi (ath11k):
      - support configuring channel dwell time during scan

   - MediaTek WiFi (mt76):
      - enable Wireless Ethernet Dispatch (WED) offload support
      - add ack signal support
      - enable coredump support
      - remain_on_channel support

   - Intel WiFi (iwlwifi):
      - enable Wi-Fi 7 Extremely High Throughput (EHT) PHY capabilities
      - 320 MHz channels support

   - RealTek WiFi (rtw89):
      - new dynamic header firmware format support
      - wake-over-WLAN support"

* tag 'net-next-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2002 commits)
  ipvs: fix type warning in do_div() on 32 bit
  net: lan966x: Remove a useless test in lan966x_ptp_add_trap()
  net: ipa: add IPA v4.7 support
  dt-bindings: net: qcom,ipa: Add SM6350 compatible
  bnxt: Use generic HBH removal helper in tx path
  IPv6/GRO: generic helper to remove temporary HBH/jumbo header in driver
  selftests: forwarding: Add bridge MDB test
  selftests: forwarding: Rename bridge_mdb test
  bridge: mcast: Support replacement of MDB port group entries
  bridge: mcast: Allow user space to specify MDB entry routing protocol
  bridge: mcast: Allow user space to add (*, G) with a source list and filter mode
  bridge: mcast: Add support for (*, G) with a source list and filter mode
  bridge: mcast: Avoid arming group timer when (S, G) corresponds to a source
  bridge: mcast: Add a flag for user installed source entries
  bridge: mcast: Expose __br_multicast_del_group_src()
  bridge: mcast: Expose br_multicast_new_group_src()
  bridge: mcast: Add a centralized error path
  bridge: mcast: Place netlink policy before validation functions
  bridge: mcast: Split (*, G) and (S, G) addition into different functions
  bridge: mcast: Do not derive entry type from its filter mode
  ...

parents 1ca06f1c 7c4a6309

Documentation/bpf/bpf_design_QA.rst

+45 −0

Original line number	Diff line number	Diff line
		@@ -298,3 +298,48 @@ A: NO.

		The BTF_ID macro does not cause a function to become part of the ABI
		any more than does the EXPORT_SYMBOL_GPL macro.

		Q: What is the compatibility story for special BPF types in map values?
		-----------------------------------------------------------------------
		Q: Users are allowed to embed bpf_spin_lock, bpf_timer fields in their BPF map
		values (when using BTF support for BPF maps). This allows to use helpers for
		such objects on these fields inside map values. Users are also allowed to embed
		pointers to some kernel types (with __kptr and __kptr_ref BTF tags). Will the
		kernel preserve backwards compatibility for these features?

		A: It depends. For bpf_spin_lock, bpf_timer: YES, for kptr and everything else:
		NO, but see below.

		For struct types that have been added already, like bpf_spin_lock and bpf_timer,
		the kernel will preserve backwards compatibility, as they are part of UAPI.

		For kptrs, they are also part of UAPI, but only with respect to the kptr
		mechanism. The types that you can use with a __kptr and __kptr_ref tagged
		pointer in your struct are NOT part of the UAPI contract. The supported types can
		and will change across kernel releases. However, operations like accessing kptr
		fields and bpf_kptr_xchg() helper will continue to be supported across kernel
		releases for the supported types.

		For any other supported struct type, unless explicitly stated in this document
		and added to bpf.h UAPI header, such types can and will arbitrarily change their
		size, type, and alignment, or any other user visible API or ABI detail across
		kernel releases. The users must adapt their BPF programs to the new changes and
		update them to make sure their programs continue to work correctly.

		NOTE: BPF subsystem specially reserves the 'bpf\_' prefix for type names, in
		order to introduce more special fields in the future. Hence, user programs must
		avoid defining types with 'bpf\_' prefix to not be broken in future releases.
		In other words, no backwards compatibility is guaranteed if one using a type
		in BTF with 'bpf\_' prefix.

		Q: What is the compatibility story for special BPF types in allocated objects?
		------------------------------------------------------------------------------
		Q: Same as above, but for allocated objects (i.e. objects allocated using
		bpf_obj_new for user defined types). Will the kernel preserve backwards
		compatibility for these features?

		A: NO.

		Unlike map value types, there are no stability guarantees for this case. The
		whole API to work with allocated objects and any support for special fields
		inside them is unstable (since it is exposed through kfuncs).

Documentation/bpf/bpf_devel_QA.rst

+27 −0

Original line number	Diff line number	Diff line
		@@ -44,6 +44,33 @@ is a guarantee that the reported issue will be overlooked.**
		Submitting patches
		==================

		Q: How do I run BPF CI on my changes before sending them out for review?
		------------------------------------------------------------------------
		A: BPF CI is GitHub based and hosted at https://github.com/kernel-patches/bpf.
		While GitHub also provides a CLI that can be used to accomplish the same
		results, here we focus on the UI based workflow.

		The following steps lay out how to start a CI run for your patches:

		- Create a fork of the aforementioned repository in your own account (one time
		action)

		- Clone the fork locally, check out a new branch tracking either the bpf-next
		or bpf branch, and apply your to-be-tested patches on top of it

		- Push the local branch to your fork and create a pull request against
		kernel-patches/bpf's bpf-next_base or bpf_base branch, respectively

		Shortly after the pull request has been created, the CI workflow will run. Note
		that capacity is shared with patches submitted upstream being checked and so
		depending on utilization the run can take a while to finish.

		Note furthermore that both base branches (bpf-next_base and bpf_base) will be
		updated as patches are pushed to the respective upstream branches they track. As
		such, your patch set will automatically (be attempted to) be rebased as well.
		This behavior can result in a CI run being aborted and restarted with the new
		base line.

		Q: To which mailing list do I need to submit my BPF patches?
		------------------------------------------------------------
		A: Please submit your BPF patches to the bpf kernel mailing list:

Documentation/bpf/bpf_iterators.rst

0 → 100644

+485 −0

Original line number	Diff line number	Diff line
		=============
		BPF Iterators
		=============


		----------
		Motivation
		----------

		There are a few existing ways to dump kernel data into user space. The most
		popular one is the ``/proc`` system. For example, ``cat /proc/net/tcp6`` dumps
		all tcp6 sockets in the system, and ``cat /proc/net/netlink`` dumps all netlink
		sockets in the system. However, their output format tends to be fixed, and if
		users want more information about these sockets, they have to patch the kernel,
		which often takes time to publish upstream and release. The same is true for popular
		tools like `ss <https://man7.org/linux/man-pages/man8/ss.8.html>`_ where any
		additional information needs a kernel patch.

		To solve this problem, the `drgn
		<https://www.kernel.org/doc/html/latest/bpf/drgn.html>`_ tool is often used to
		dig out the kernel data with no kernel change. However, the main drawback for
		drgn is performance, as it cannot do pointer tracing inside the kernel. In
		addition, drgn cannot validate a pointer value and may read invalid data if the
		pointer becomes invalid inside the kernel.

		The BPF iterator solves the above problem by providing flexibility on what data
		(e.g., tasks, bpf_maps, etc.) to collect by calling BPF programs for each kernel
		data object.

		----------------------
		How BPF Iterators Work
		----------------------

		A BPF iterator is a type of BPF program that allows users to iterate over
		specific types of kernel objects. Unlike traditional BPF tracing programs that
		allow users to define callbacks that are invoked at particular points of
		execution in the kernel, BPF iterators allow users to define callbacks that
		should be executed for every entry in a variety of kernel data structures.

		For example, users can define a BPF iterator that iterates over every task on
		the system and dumps the total amount of CPU runtime currently used by each of
		them. Another BPF task iterator may instead dump the cgroup information for each
		task. Such flexibility is the core value of BPF iterators.

		A BPF program is always loaded into the kernel at the behest of a user space
		process. A user space process loads a BPF program by opening and initializing
		the program skeleton as required and then invoking a syscall to have the BPF
		program verified and loaded by the kernel.

		In traditional tracing programs, a program is activated by having user space
		obtain a ``bpf_link`` to the program with ``bpf_program__attach()``. Once
		activated, the program callback will be invoked whenever the tracepoint is
		triggered in the main kernel. For BPF iterator programs, a ``bpf_link`` to the
		program is obtained using ``bpf_link_create()``, and the program callback is
		invoked by issuing system calls from user space.

		Next, let us see how you can use the iterators to iterate on kernel objects and
		read data.

		------------------------
		How to Use BPF iterators
		------------------------

		BPF selftests are a great resource to illustrate how to use the iterators. In
		this section, we’ll walk through a BPF selftest which shows how to load and use
		a BPF iterator program. To begin, we’ll look at `bpf_iter.c
		<https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/prog_tests/bpf_iter.c>`_,
		which illustrates how to load and trigger BPF iterators on the user space side.
		Later, we’ll look at a BPF program that runs in kernel space.

		Loading a BPF iterator in the kernel from user space typically involves the
		following steps:

		* The BPF program is loaded into the kernel through ``libbpf``. Once the kernel
		has verified and loaded the program, it returns a file descriptor (fd) to user
		space.
		* Obtain a ``link_fd`` to the BPF program by calling the ``bpf_link_create()``
		specified with the BPF program file descriptor received from the kernel.
		* Next, obtain a BPF iterator file descriptor (``bpf_iter_fd``) by calling the
		``bpf_iter_create()`` specified with the ``bpf_link`` received from Step 2.
		* Trigger the iteration by calling ``read(bpf_iter_fd)`` until no data is
		available.
		* Close the iterator fd using ``close(bpf_iter_fd)``.
		* If needed to reread the data, get a new ``bpf_iter_fd`` and do the read again.

		The following are a few examples of selftest BPF iterator programs:

		* `bpf_iter_tcp4.c <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/bpf_iter_tcp4.c>`_
		* `bpf_iter_task_vma.c <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/bpf_iter_task_vma.c>`_
		* `bpf_iter_task_file.c <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/bpf_iter_task_file.c>`_

		Let us look at ``bpf_iter_task_file.c``, which runs in kernel space:

		Here is the definition of ``bpf_iter__task_file`` in `vmlinux.h
		<https://facebookmicrosites.github.io/bpf/blog/2020/02/19/bpf-portability-and-co-re.html#btf>`_.
		Any struct name in ``vmlinux.h`` in the format ``bpf_iter__<iter_name>``
		represents a BPF iterator. The suffix ``<iter_name>`` represents the type of
		iterator.

		::

		struct bpf_iter__task_file {
		union {
		struct bpf_iter_meta *meta;
		};
		union {
		struct task_struct *task;
		};
		u32 fd;
		union {
		struct file *file;
		};
		};

		In the above code, the field 'meta' contains the metadata, which is the same for
		all BPF iterator programs. The rest of the fields are specific to different
		iterators. For example, for task_file iterators, the kernel layer provides the
		'task', 'fd' and 'file' field values. The 'task' and 'file' are `reference
		counted
		<https://facebookmicrosites.github.io/bpf/blog/2018/08/31/object-lifetime.html#file-descriptors-and-reference-counters>`_,
		so they won't go away when the BPF program runs.

		Here is a snippet from the ``bpf_iter_task_file.c`` file:

		::

		SEC("iter/task_file")
		int dump_task_file(struct bpf_iter__task_file *ctx)
		{
		struct seq_file *seq = ctx->meta->seq;
		struct task_struct *task = ctx->task;
		struct file *file = ctx->file;
		__u32 fd = ctx->fd;

		if (task == NULL \|\| file == NULL)
		return 0;

		if (ctx->meta->seq_num == 0) {
		count = 0;
		BPF_SEQ_PRINTF(seq, " tgid gid fd file\n");
		}

		if (tgid == task->tgid && task->tgid != task->pid)
		count++;

		if (last_tgid != task->tgid) {
		last_tgid = task->tgid;
		unique_tgid_count++;
		}

		BPF_SEQ_PRINTF(seq, "%8d %8d %8d %lx\n", task->tgid, task->pid, fd,
		(long)file->f_op);
		return 0;
		}

		In the above example, the section name ``SEC(iter/task_file)``, indicates that
		the program is a BPF iterator program to iterate all files from all tasks. The
		context of the program is ``bpf_iter__task_file`` struct.

		The user space program invokes the BPF iterator program running in the kernel
		by issuing a ``read()`` syscall. Once invoked, the BPF
		program can export data to user space using a variety of BPF helper functions.
		You can use either ``bpf_seq_printf()`` (and BPF_SEQ_PRINTF helper macro) or
		``bpf_seq_write()`` function based on whether you need formatted output or just
		binary data, respectively. For binary-encoded data, the user space applications
		can process the data from ``bpf_seq_write()`` as needed. For the formatted data,
		you can use ``cat <path>`` to print the results similar to ``cat
		/proc/net/netlink`` after pinning the BPF iterator to the bpffs mount. Later,
		use ``rm -f <path>`` to remove the pinned iterator.

		For example, you can use the following command to create a BPF iterator from the
		``bpf_iter_ipv6_route.o`` object file and pin it to the ``/sys/fs/bpf/my_route``
		path:

		::

		$ bpftool iter pin ./bpf_iter_ipv6_route.o /sys/fs/bpf/my_route

		And then print out the results using the following command:

		::

		$ cat /sys/fs/bpf/my_route


		-------------------------------------------------------
		Implement Kernel Support for BPF Iterator Program Types
		-------------------------------------------------------

		To implement a BPF iterator in the kernel, the developer must make a one-time
		change to the following key data structure defined in the `bpf.h
		<https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/include/linux/bpf.h>`_
		file.

		::

		struct bpf_iter_reg {
		const char *target;
		bpf_iter_attach_target_t attach_target;
		bpf_iter_detach_target_t detach_target;
		bpf_iter_show_fdinfo_t show_fdinfo;
		bpf_iter_fill_link_info_t fill_link_info;
		bpf_iter_get_func_proto_t get_func_proto;
		u32 ctx_arg_info_size;
		u32 feature;
		struct bpf_ctx_arg_aux ctx_arg_info[BPF_ITER_CTX_ARG_MAX];
		const struct bpf_iter_seq_info *seq_info;
		};

		After filling the data structure fields, call ``bpf_iter_reg_target()`` to
		register the iterator to the main BPF iterator subsystem.

		The following is the breakdown for each field in struct ``bpf_iter_reg``.

		.. list-table::
		:widths: 25 50
		:header-rows: 1

		* - Fields
		- Description
		* - target
		- Specifies the name of the BPF iterator. For example: ``bpf_map``,
		``bpf_map_elem``. The name should be different from other ``bpf_iter`` target names in the kernel.
		* - attach_target and detach_target
		- Allows for target specific ``link_create`` action since some targets
		may need special processing. Called during the user space link_create stage.
		* - show_fdinfo and fill_link_info
		- Called to fill target specific information when user tries to get link
		info associated with the iterator.
		* - get_func_proto
		- Permits a BPF iterator to access BPF helpers specific to the iterator.
		* - ctx_arg_info_size and ctx_arg_info
		- Specifies the verifier states for BPF program arguments associated with
		the bpf iterator.
		* - feature
		- Specifies certain action requests in the kernel BPF iterator
		infrastructure. Currently, only BPF_ITER_RESCHED is supported. This means
		that the kernel function cond_resched() is called to avoid other kernel
		subsystem (e.g., rcu) misbehaving.
		* - seq_info
		- Specifies certain action requests in the kernel BPF iterator
		infrastructure. Currently, only BPF_ITER_RESCHED is supported. This means
		that the kernel function cond_resched() is called to avoid other kernel
		subsystem (e.g., rcu) misbehaving.


		`Click here
		<https://lore.kernel.org/bpf/20210212183107.50963-2-songliubraving@fb.com/>`_
		to see an implementation of the ``task_vma`` BPF iterator in the kernel.

		---------------------------------
		Parameterizing BPF Task Iterators
		---------------------------------

		By default, BPF iterators walk through all the objects of the specified types
		(processes, cgroups, maps, etc.) across the entire system to read relevant
		kernel data. But often, there are cases where we only care about a much smaller
		subset of iterable kernel objects, such as only iterating tasks within a
		specific process. Therefore, BPF iterator programs support filtering out objects
		from iteration by allowing user space to configure the iterator program when it
		is attached.

		--------------------------
		BPF Task Iterator Program
		--------------------------

		The following code is a BPF iterator program to print files and task information
		through the ``seq_file`` of the iterator. It is a standard BPF iterator program
		that visits every file of an iterator. We will use this BPF program in our
		example later.

		::

		#include <vmlinux.h>
		#include <bpf/bpf_helpers.h>

		char _license[] SEC("license") = "GPL";

		SEC("iter/task_file")
		int dump_task_file(struct bpf_iter__task_file *ctx)
		{
		struct seq_file *seq = ctx->meta->seq;
		struct task_struct *task = ctx->task;
		struct file *file = ctx->file;
		__u32 fd = ctx->fd;
		if (task == NULL \|\| file == NULL)
		return 0;
		if (ctx->meta->seq_num == 0) {
		BPF_SEQ_PRINTF(seq, " tgid pid fd file\n");
		}
		BPF_SEQ_PRINTF(seq, "%8d %8d %8d %lx\n", task->tgid, task->pid, fd,
		(long)file->f_op);
		return 0;
		}

		----------------------------------------
		Creating a File Iterator with Parameters
		----------------------------------------

		Now, let us look at how to create an iterator that includes only files of a
		process.

		First, fill the ``bpf_iter_attach_opts`` struct as shown below:

		::

		LIBBPF_OPTS(bpf_iter_attach_opts, opts);
		union bpf_iter_link_info linfo;
		memset(&linfo, 0, sizeof(linfo));
		linfo.task.pid = getpid();
		opts.link_info = &linfo;
		opts.link_info_len = sizeof(linfo);

		``linfo.task.pid``, if it is non-zero, directs the kernel to create an iterator
		that only includes opened files for the process with the specified ``pid``. In
		this example, we will only be iterating files for our process. If
		``linfo.task.pid`` is zero, the iterator will visit every opened file of every
		process. Similarly, ``linfo.task.tid`` directs the kernel to create an iterator
		that visits opened files of a specific thread, not a process. In this example,
		``linfo.task.tid`` is different from ``linfo.task.pid`` only if the thread has a
		separate file descriptor table. In most circumstances, all process threads share
		a single file descriptor table.

		Now, in the userspace program, pass the pointer of struct to the
		``bpf_program__attach_iter()``.

		::

		link = bpf_program__attach_iter(prog, &opts); iter_fd =
		bpf_iter_create(bpf_link__fd(link));

		If both tid and pid are zero, an iterator created from this struct
		``bpf_iter_attach_opts`` will include every opened file of every task in the
		system (in the namespace, actually.) It is the same as passing a NULL as the
		second argument to ``bpf_program__attach_iter()``.

		The whole program looks like the following code:

		::

		#include <stdio.h>
		#include <unistd.h>
		#include <bpf/bpf.h>
		#include <bpf/libbpf.h>
		#include "bpf_iter_task_ex.skel.h"

		static int do_read_opts(struct bpf_program prog, struct bpf_iter_attach_opts opts)
		{
		struct bpf_link *link;
		char buf[16] = {};
		int iter_fd = -1, len;
		int ret = 0;

		link = bpf_program__attach_iter(prog, opts);
		if (!link) {
		fprintf(stderr, "bpf_program__attach_iter() fails\n");
		return -1;
		}
		iter_fd = bpf_iter_create(bpf_link__fd(link));
		if (iter_fd < 0) {
		fprintf(stderr, "bpf_iter_create() fails\n");
		ret = -1;
		goto free_link;
		}
		/* not check contents, but ensure read() ends without error */
		while ((len = read(iter_fd, buf, sizeof(buf) - 1)) > 0) {
		buf[len] = 0;
		printf("%s", buf);
		}
		printf("\n");
		free_link:
		if (iter_fd >= 0)
		close(iter_fd);
		bpf_link__destroy(link);
		return 0;
		}

		static void test_task_file(void)
		{
		LIBBPF_OPTS(bpf_iter_attach_opts, opts);
		struct bpf_iter_task_ex *skel;
		union bpf_iter_link_info linfo;
		skel = bpf_iter_task_ex__open_and_load();
		if (skel == NULL)
		return;
		memset(&linfo, 0, sizeof(linfo));
		linfo.task.pid = getpid();
		opts.link_info = &linfo;
		opts.link_info_len = sizeof(linfo);
		printf("PID %d\n", getpid());
		do_read_opts(skel->progs.dump_task_file, &opts);
		bpf_iter_task_ex__destroy(skel);
		}

		int main(int argc, const char * const * argv)
		{
		test_task_file();
		return 0;
		}

		The following lines are the output of the program.
		::

		PID 1859

		tgid pid fd file
		1859 1859 0 ffffffff82270aa0
		1859 1859 1 ffffffff82270aa0
		1859 1859 2 ffffffff82270aa0
		1859 1859 3 ffffffff82272980
		1859 1859 4 ffffffff8225e120
		1859 1859 5 ffffffff82255120
		1859 1859 6 ffffffff82254f00
		1859 1859 7 ffffffff82254d80
		1859 1859 8 ffffffff8225abe0

		------------------
		Without Parameters
		------------------

		Let us look at how a BPF iterator without parameters skips files of other
		processes in the system. In this case, the BPF program has to check the pid or
		the tid of tasks, or it will receive every opened file in the system (in the
		current pid namespace, actually). So, we usually add a global variable in the
		BPF program to pass a pid to the BPF program.

		The BPF program would look like the following block.

		::

		......
		int target_pid = 0;

		SEC("iter/task_file")
		int dump_task_file(struct bpf_iter__task_file *ctx)
		{
		......
		if (task->tgid != target_pid) /* Check task->pid instead to check thread IDs */
		return 0;
		BPF_SEQ_PRINTF(seq, "%8d %8d %8d %lx\n", task->tgid, task->pid, fd,
		(long)file->f_op);
		return 0;
		}

		The user space program would look like the following block:

		::

		......
		static void test_task_file(void)
		{
		......
		skel = bpf_iter_task_ex__open_and_load();
		if (skel == NULL)
		return;
		skel->bss->target_pid = getpid(); /* process ID. For thread id, use gettid() */
		memset(&linfo, 0, sizeof(linfo));
		linfo.task.pid = getpid();
		opts.link_info = &linfo;
		opts.link_info_len = sizeof(linfo);
		......
		}

		``target_pid`` is a global variable in the BPF program. The user space program
		should initialize the variable with a process ID to skip opened files of other
		processes in the BPF program. When you parametrize a BPF iterator, the iterator
		calls the BPF program fewer times which can save significant resources.

		---------------------------
		Parametrizing VMA Iterators
		---------------------------

		By default, a BPF VMA iterator includes every VMA in every process. However,
		you can still specify a process or a thread to include only its VMAs. Unlike
		files, a thread can not have a separate address space (since Linux 2.6.0-test6).
		Here, using tid makes no difference from using pid.

		----------------------------
		Parametrizing Task Iterators
		----------------------------

		A BPF task iterator with pid includes all tasks (threads) of a process. The
		BPF program receives these tasks one after another. You can specify a BPF task
		iterator with tid parameter to include only the tasks that match the given
		tid.

Documentation/bpf/btf.rst

+6 −1

Original line number	Diff line number	Diff line
		@@ -1062,4 +1062,9 @@ format.::
		7. Testing
		==========

		Kernel bpf selftest `test_btf.c` provides extensive set of BTF-related tests.
		The kernel BPF selftest `tools/testing/selftests/bpf/prog_tests/btf.c`_
		provides an extensive set of BTF-related tests.

		.. Links
		.. _tools/testing/selftests/bpf/prog_tests/btf.c:
		https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/tools/testing/selftests/bpf/prog_tests/btf.c

Documentation/bpf/index.rst

+2 −0

Original line number	Diff line number	Diff line
		@@ -24,11 +24,13 @@ that goes into great technical depth about the BPF Architecture.
		maps
		bpf_prog_run
		classic_vs_extended.rst
		bpf_iterators
		bpf_licensing
		test_debug
		clang-notes
		linux-notes
		other
		redirect

		.. only:: subproject and html