Commit 0326074f authored by Linus Torvalds's avatar Linus Torvalds
Browse files
Pull networking updates from Jakub Kicinski:
 "Core:

   - Introduce and use a single page frag cache for allocating small skb
     heads, clawing back the 10-20% performance regression in UDP flood
     test from previous fixes.

   - Run packets which already went thru HW coalescing thru SW GRO. This
     significantly improves TCP segment coalescing and simplifies
     deployments as different workloads benefit from HW or SW GRO.

   - Shrink the size of the base zero-copy send structure.

   - Move TCP init under a new slow / sleepable version of DO_ONCE().

  BPF:

   - Add BPF-specific, any-context-safe memory allocator.

   - Add helpers/kfuncs for PKCS#7 signature verification from BPF
     programs.

   - Define a new map type and related helpers for user space -> kernel
     communication over a ring buffer (BPF_MAP_TYPE_USER_RINGBUF).

   - Allow targeting BPF iterators to loop through resources of one
     task/thread.

   - Add ability to call selected destructive functions. Expose
     crash_kexec() to allow BPF to trigger a kernel dump. Use
     CAP_SYS_BOOT check on the loading process to judge permissions.

   - Enable BPF to collect custom hierarchical cgroup stats efficiently
     by integrating with the rstat framework.

   - Support struct arguments for trampoline based programs. Only
     structs with size <= 16B and x86 are supported.

   - Invoke cgroup/connect{4,6} programs for unprivileged ICMP ping
     sockets (instead of just TCP and UDP sockets).

   - Add a helper for accessing CLOCK_TAI for time sensitive network
     related programs.

   - Support accessing network tunnel metadata's flags.

   - Make TCP SYN ACK RTO tunable by BPF programs with TCP Fast Open.

   - Add support for writing to Netfilter's nf_conn:mark.

  Protocols:

   - WiFi: more Extremely High Throughput (EHT) and Multi-Link Operation
     (MLO) work (802.11be, WiFi 7).

   - vsock: improve support for SO_RCVLOWAT.

   - SMC: support SO_REUSEPORT.

   - Netlink: define and document how to use netlink in a "modern" way.
     Support reporting missing attributes via extended ACK.

   - IPSec: support collect metadata mode for xfrm interfaces.

   - TCPv6: send consistent autoflowlabel in SYN_RECV state and RST
     packets.

   - TCP: introduce optional per-netns connection hash table to allow
     better isolation between namespaces (opt-in, at the cost of memory
     and cache pressure).

   - MPTCP: support TCP_FASTOPEN_CONNECT.

   - Add NEXT-C-SID support in Segment Routing (SRv6) End behavior.

   - Adjust IP_UNICAST_IF sockopt behavior for connected UDP sockets.

   - Open vSwitch:
      - Allow specifying ifindex of new interfaces.
      - Allow conntrack and metering in non-initial user namespace.

   - TLS: support the Korean ARIA-GCM crypto algorithm.

   - Remove DECnet support.

  Driver API:

   - Allow selecting the conduit interface used by each port in DSA
     switches, at runtime.

   - Ethernet Power Sourcing Equipment and Power Device support.

   - Add tc-taprio support for queueMaxSDU parameter, i.e. setting per
     traffic class max frame size for time-based packet schedules.

   - Support PHY rate matching - adapting between differing host-side
     and link-side speeds.

   - Introduce QUSGMII PHY mode and 1000BASE-KX interface mode.

   - Validate OF (device tree) nodes for DSA shared ports; make
     phylink-related properties mandatory on DSA and CPU ports.
     Enforcing more uniformity should allow transitioning to phylink.

   - Require that flash component name used during update matches one of
     the components for which version is reported by info_get().

   - Remove "weight" argument from driver-facing NAPI API as much as
     possible. It's one of those magic knobs which seemed like a good
     idea at the time but is too indirect to use in practice.

   - Support offload of TLS connections with 256 bit keys.

  New hardware / drivers:

   - Ethernet:
      - Microchip KSZ9896 6-port Gigabit Ethernet Switch
      - Renesas Ethernet AVB (EtherAVB-IF) Gen4 SoCs
      - Analog Devices ADIN1110 and ADIN2111 industrial single pair
        Ethernet (10BASE-T1L) MAC+PHY.
      - Rockchip RV1126 Gigabit Ethernet (a version of stmmac IP).

   - Ethernet SFPs / modules:
      - RollBall / Hilink / Turris 10G copper SFPs
      - HALNy GPON module

   - WiFi:
      - CYW43439 SDIO chipset (brcmfmac)
      - CYW89459 PCIe chipset (brcmfmac)
      - BCM4378 on Apple platforms (brcmfmac)

  Drivers:

   - CAN:
      - gs_usb: HW timestamp support

   - Ethernet PHYs:
      - lan8814: cable diagnostics

   - Ethernet NICs:
      - Intel (100G):
         - implement control of FCS/CRC stripping
         - port splitting via devlink
         - L2TPv3 filtering offload
      - nVidia/Mellanox:
         - tunnel offload for sub-functions
         - MACSec offload, w/ Extended packet number and replay window
           offload
         - significantly restructure, and optimize the AF_XDP support,
           align the behavior with other vendors
      - Huawei:
         - configuring DSCP map for traffic class selection
         - querying standard FEC statistics
         - querying SerDes lane number via ethtool
      - Marvell/Cavium:
         - egress priority flow control
         - MACSec offload
      - AMD/SolarFlare:
         - PTP over IPv6 and raw Ethernet
      - small / embedded:
         - ax88772: convert to phylink (to support SFP cages)
         - altera: tse: convert to phylink
         - ftgmac100: support fixed link
         - enetc: standard Ethtool counters
         - macb: ZynqMP SGMII dynamic configuration support
         - tsnep: support multi-queue and use page pool
         - lan743x: Rx IP & TCP checksum offload
         - igc: add xdp frags support to ndo_xdp_xmit

   - Ethernet high-speed switches:
      - Marvell (prestera):
         - support SPAN port features (traffic mirroring)
         - nexthop object offloading
      - Microchip (sparx5):
         - multicast forwarding offload
         - QoS queuing offload (tc-mqprio, tc-tbf, tc-ets)

   - Ethernet embedded switches:
      - Marvell (mv88e6xxx):
         - support RGMII cmode
      - NXP (felix):
         - standardized ethtool counters
      - Microchip (lan966x):
         - QoS queuing offload (tc-mqprio, tc-tbf, tc-cbs, tc-ets)
         - traffic policing and mirroring
         - link aggregation / bonding offload
         - QUSGMII PHY mode support

   - Qualcomm 802.11ax WiFi (ath11k):
      - cold boot calibration support on WCN6750
      - support to connect to a non-transmit MBSSID AP profile
      - enable remain-on-channel support on WCN6750
      - Wake-on-WLAN support for WCN6750
      - support to provide transmit power from firmware via nl80211
      - support to get power save duration for each client
      - spectral scan support for 160 MHz

   - MediaTek WiFi (mt76):
      - WiFi-to-Ethernet bridging offload for MT7986 chips

   - RealTek WiFi (rtw89):
      - P2P support"

* tag 'net-next-6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1864 commits)
  eth: pse: add missing static inlines
  once: rename _SLOW to _SLEEPABLE
  net: pse-pd: add regulator based PSE driver
  dt-bindings: net: pse-dt: add bindings for regulator based PoDL PSE controller
  ethtool: add interface to interact with Ethernet Power Equipment
  net: mdiobus: search for PSE nodes by parsing PHY nodes.
  net: mdiobus: fwnode_mdiobus_register_phy() rework error handling
  net: add framework to support Ethernet PSE and PDs devices
  dt-bindings: net: phy: add PoDL PSE property
  net: marvell: prestera: Propagate nh state from hw to kernel
  net: marvell: prestera: Add neighbour cache accounting
  net: marvell: prestera: add stub handler neighbour events
  net: marvell: prestera: Add heplers to interact with fib_notifier_info
  net: marvell: prestera: Add length macros for prestera_ip_addr
  net: marvell: prestera: add delayed wq and flush wq on deinit
  net: marvell: prestera: Add strict cleanup of fib arbiter
  net: marvell: prestera: Add cleanup of allocated fib_nodes
  net: marvell: prestera: Add router nexthops ABI
  eth: octeon: fix build after netif_napi_add() changes
  net/mlx5: E-Switch, Return EBUSY if can't get mode lock
  ...
parents 522667b2 681bf011
Loading
Loading
Loading
Loading
+0 −4
Original line number Diff line number Diff line
@@ -966,10 +966,6 @@

	debugpat	[X86] Enable PAT debugging

	decnet.addr=	[HW,NET]
			Format: <area>[,<node>]
			See also Documentation/networking/decnet.rst.

	default_hugepagesz=
			[HW] The size of the default HugeTLB page. This is
			the size represented by the legacy /proc/ hugepages
+13 −9
Original line number Diff line number Diff line
@@ -31,17 +31,18 @@ see only some of them, depending on your kernel's configuration.

Table : Subdirectories in /proc/sys/net

 ========= =================== = ========== ==================
 ========= =================== = ========== ===================
 Directory Content               Directory  Content
 ========= =================== = ========== ==================
 core      General parameter     appletalk  Appletalk protocol
 unix      Unix domain sockets   netrom     NET/ROM
 802       E802 protocol         ax25       AX25
 ethernet  Ethernet protocol     rose       X.25 PLP layer
 ========= =================== = ========== ===================
 802       E802 protocol         mptcp      Multipath TCP
 appletalk Appletalk protocol    netfilter  Network Filter
 ax25      AX25                  netrom     NET/ROM
 bridge    Bridging              rose       X.25 PLP layer
 core      General parameter     tipc       TIPC
 ethernet  Ethernet protocol     unix       Unix domain sockets
 ipv4      IP version 4          x25        X.25 protocol
 bridge    Bridging              decnet     DEC net
 ipv6      IP version 6          tipc       TIPC
 ========= =================== = ========== ==================
 ipv6      IP version 6
 ========= =================== = ========== ===================

1. /proc/sys/net/core - Network core options
============================================
@@ -101,6 +102,9 @@ Values:
	- 1 - enable JIT hardening for unprivileged users only
	- 2 - enable JIT hardening for all users

where "privileged user" in this context means a process having
CAP_BPF or CAP_SYS_ADMIN in the root user name space.

bpf_jit_kallsyms
----------------

+30 −0
Original line number Diff line number Diff line
.. contents::
.. sectnum::

==========================
Clang implementation notes
==========================

This document provides more details specific to the Clang/LLVM implementation of the eBPF instruction set.

Versions
========

Clang defined "CPU" versions, where a CPU version of 3 corresponds to the current eBPF ISA.

Clang can select the eBPF ISA version using ``-mcpu=v3`` for example to select version 3.

Arithmetic instructions
=======================

For CPU versions prior to 3, Clang v7.0 and later can enable ``BPF_ALU`` support with
``-Xclang -target-feature -Xclang +alu32``.  In CPU version 3, support is automatically included.

Atomic operations
=================

Clang can generate atomic instructions by default when ``-mcpu=v3`` is
enabled. If a lower version for ``-mcpu`` is set, the only atomic instruction
Clang can generate is ``BPF_ADD`` *without* ``BPF_FETCH``. If you need to enable
the atomics features, while keeping a lower ``-mcpu`` version, you can use
``-Xclang -target-feature -Xclang +alu32``.
+2 −0
Original line number Diff line number Diff line
@@ -26,6 +26,8 @@ that goes into great technical depth about the BPF Architecture.
   classic_vs_extended.rst
   bpf_licensing
   test_debug
   clang-notes
   linux-notes
   other

.. only::  subproject and html
+139 −177
Original line number Diff line number Diff line
.. contents::
.. sectnum::

========================================
eBPF Instruction Set Specification, v1.0
========================================

This document specifies version 1.0 of the eBPF instruction set.

====================
eBPF Instruction Set
====================

Registers and calling convention
================================
@@ -44,24 +49,24 @@ Instruction classes

The three LSB bits of the 'opcode' field store the instruction class:

  =========  =====  ===============================
  class      value  description
  =========  =====  ===============================
  BPF_LD     0x00   non-standard load operations
  BPF_LDX    0x01   load into register operations
  BPF_ST     0x02   store from immediate operations
  BPF_STX    0x03   store from register operations
  BPF_ALU    0x04   32-bit arithmetic operations
  BPF_JMP    0x05   64-bit jump operations
  BPF_JMP32  0x06   32-bit jump operations
  BPF_ALU64  0x07   64-bit arithmetic operations
  =========  =====  ===============================
=========  =====  ===============================  ===================================
class      value  description                      reference
=========  =====  ===============================  ===================================
BPF_LD     0x00   non-standard load operations     `Load and store instructions`_
BPF_LDX    0x01   load into register operations    `Load and store instructions`_
BPF_ST     0x02   store from immediate operations  `Load and store instructions`_
BPF_STX    0x03   store from register operations   `Load and store instructions`_
BPF_ALU    0x04   32-bit arithmetic operations     `Arithmetic and jump instructions`_
BPF_JMP    0x05   64-bit jump operations           `Arithmetic and jump instructions`_
BPF_JMP32  0x06   32-bit jump operations           `Arithmetic and jump instructions`_
BPF_ALU64  0x07   64-bit arithmetic operations     `Arithmetic and jump instructions`_
=========  =====  ===============================  ===================================

Arithmetic and jump instructions
================================

For arithmetic and jump instructions (BPF_ALU, BPF_ALU64, BPF_JMP and
BPF_JMP32), the 8-bit 'opcode' field is divided into three parts:
For arithmetic and jump instructions (``BPF_ALU``, ``BPF_ALU64``, ``BPF_JMP`` and
``BPF_JMP32``), the 8-bit 'opcode' field is divided into three parts:

==============  ======  =================
4 bits (MSB)    1 bit   3 bits (LSB)
@@ -84,13 +89,13 @@ The four MSB bits store the operation code.
Arithmetic instructions
-----------------------

BPF_ALU uses 32-bit wide operands while BPF_ALU64 uses 64-bit wide operands for
``BPF_ALU`` uses 32-bit wide operands while ``BPF_ALU64`` uses 64-bit wide operands for
otherwise identical operations.
The code field encodes the operation as below:
The 'code' field encodes the operation as below:

  ========  =====  =================================================
========  =====  ==========================================================
code      value  description
  ========  =====  =================================================
========  =====  ==========================================================
BPF_ADD   0x00   dst += src
BPF_SUB   0x10   dst -= src
BPF_MUL   0x20   dst \*= src
@@ -104,31 +109,31 @@ The code field encodes the operation as below:
BPF_XOR   0xa0   dst ^= src
BPF_MOV   0xb0   dst = src
BPF_ARSH  0xc0   sign extending shift right
  BPF_END   0xd0   byte swap operations (see separate section below)
  ========  =====  =================================================
BPF_END   0xd0   byte swap operations (see `Byte swap instructions`_ below)
========  =====  ==========================================================

BPF_ADD | BPF_X | BPF_ALU means::
``BPF_ADD | BPF_X | BPF_ALU`` means::

  dst_reg = (u32) dst_reg + (u32) src_reg;

BPF_ADD | BPF_X | BPF_ALU64 means::
``BPF_ADD | BPF_X | BPF_ALU64`` means::

  dst_reg = dst_reg + src_reg

BPF_XOR | BPF_K | BPF_ALU means::
``BPF_XOR | BPF_K | BPF_ALU`` means::

  src_reg = (u32) src_reg ^ (u32) imm32

BPF_XOR | BPF_K | BPF_ALU64 means::
``BPF_XOR | BPF_K | BPF_ALU64`` means::

  src_reg = src_reg ^ imm32


Byte swap instructions
----------------------
~~~~~~~~~~~~~~~~~~~~~~

The byte swap instructions use an instruction class of ``BPF_ALU`` and a 4-bit
code field of ``BPF_END``.
'code' field of ``BPF_END``.

The byte swap instructions operate on the destination register
only and do not use a separate source register or immediate value.
@@ -143,7 +148,7 @@ order the operation convert from or to:
BPF_TO_BE  0x08   convert between host byte order and big endian
=========  =====  =================================================

The imm field encodes the width of the swap operations.  The following widths
The 'imm' field encodes the width of the swap operations.  The following widths
are supported: 16, 32 and 64.

Examples:
@@ -156,16 +161,12 @@ Examples:

  dst_reg = htobe64(dst_reg)

``BPF_FROM_LE`` and ``BPF_FROM_BE`` exist as aliases for ``BPF_TO_LE`` and
``BPF_TO_BE`` respectively.


Jump instructions
-----------------

BPF_JMP32 uses 32-bit wide operands while BPF_JMP uses 64-bit wide operands for
``BPF_JMP32`` uses 32-bit wide operands while ``BPF_JMP`` uses 64-bit wide operands for
otherwise identical operations.
The code field encodes the operation as below:
The 'code' field encodes the operation as below:

========  =====  =========================  ============
code      value  description                notes
@@ -193,7 +194,7 @@ BPF_EXIT.
Load and store instructions
===========================

For load and store instructions (BPF_LD, BPF_LDX, BPF_ST and BPF_STX), the
For load and store instructions (``BPF_LD``, ``BPF_LDX``, ``BPF_ST``, and ``BPF_STX``), the
8-bit 'opcode' field is divided as:

============  ======  =================
@@ -202,6 +203,18 @@ For load and store instructions (BPF_LD, BPF_LDX, BPF_ST and BPF_STX), the
mode          size    instruction class
============  ======  =================

The mode modifier is one of:

  =============  =====  ====================================  =============
  mode modifier  value  description                           reference
  =============  =====  ====================================  =============
  BPF_IMM        0x00   64-bit immediate instructions         `64-bit immediate instructions`_
  BPF_ABS        0x20   legacy BPF packet access (absolute)   `Legacy BPF Packet access instructions`_
  BPF_IND        0x40   legacy BPF packet access (indirect)   `Legacy BPF Packet access instructions`_
  BPF_MEM        0x60   regular load and store operations     `Regular load and store operations`_
  BPF_ATOMIC     0xc0   atomic operations                     `Atomic operations`_
  =============  =====  ====================================  =============

The size modifier is one of:

  =============  =====  =====================
@@ -213,19 +226,6 @@ The size modifier is one of:
  BPF_DW         0x18   double word (8 bytes)
  =============  =====  =====================

The mode modifier is one of:

  =============  =====  ====================================
  mode modifier  value  description
  =============  =====  ====================================
  BPF_IMM        0x00   64-bit immediate instructions
  BPF_ABS        0x20   legacy BPF packet access (absolute)
  BPF_IND        0x40   legacy BPF packet access (indirect)
  BPF_MEM        0x60   regular load and store operations
  BPF_ATOMIC     0xc0   atomic operations
  =============  =====  ====================================


Regular load and store operations
---------------------------------

@@ -260,9 +260,9 @@ that use the ``BPF_ATOMIC`` mode modifier as follows:
* ``BPF_ATOMIC | BPF_DW | BPF_STX`` for 64-bit operations
* 8-bit and 16-bit wide atomic operations are not supported.

The imm field is used to encode the actual atomic operation.
The 'imm' field is used to encode the actual atomic operation.
Simple atomic operation use a subset of the values defined to encode
arithmetic operations in the imm field to encode the atomic operation:
arithmetic operations in the 'imm' field to encode the atomic operation:

========  =====  ===========
imm       value  description
@@ -274,16 +274,14 @@ arithmetic operations in the imm field to encode the atomic operation:
========  =====  ===========


``BPF_ATOMIC | BPF_W  | BPF_STX`` with imm = BPF_ADD means::
``BPF_ATOMIC | BPF_W  | BPF_STX`` with 'imm' = BPF_ADD means::

  *(u32 *)(dst_reg + off16) += src_reg

``BPF_ATOMIC | BPF_DW | BPF_STX`` with imm = BPF ADD means::
``BPF_ATOMIC | BPF_DW | BPF_STX`` with 'imm' = BPF ADD means::

  *(u64 *)(dst_reg + off16) += src_reg

``BPF_XADD`` is a deprecated name for ``BPF_ATOMIC | BPF_ADD``.

In addition to the simple atomic operations, there also is a modifier and
two complex atomic operations:

@@ -309,16 +307,10 @@ The ``BPF_CMPXCHG`` operation atomically compares the value addressed by
value that was at ``dst_reg + off`` before the operation is zero-extended
and loaded back to ``R0``.

Clang can generate atomic instructions by default when ``-mcpu=v3`` is
enabled. If a lower version for ``-mcpu`` is set, the only atomic instruction
Clang can generate is ``BPF_ADD`` *without* ``BPF_FETCH``. If you need to enable
the atomics features, while keeping a lower ``-mcpu`` version, you can use
``-Xclang -target-feature -Xclang +alu32``.

64-bit immediate instructions
-----------------------------

Instructions with the ``BPF_IMM`` mode modifier use the wide instruction
Instructions with the ``BPF_IMM`` 'mode' modifier use the wide instruction
encoding for an extra imm64 value.

There is currently only one such instruction.
@@ -331,36 +323,6 @@ There is currently only one such instruction.
Legacy BPF Packet access instructions
-------------------------------------

eBPF has special instructions for access to packet data that have been
carried over from classic BPF to retain the performance of legacy socket
filters running in the eBPF interpreter.

The instructions come in two forms: ``BPF_ABS | <size> | BPF_LD`` and
``BPF_IND | <size> | BPF_LD``.

These instructions are used to access packet data and can only be used when
the program context is a pointer to networking packet.  ``BPF_ABS``
accesses packet data at an absolute offset specified by the immediate data
and ``BPF_IND`` access packet data at an offset that includes the value of
a register in addition to the immediate data.

These instructions have seven implicit operands:

 * Register R6 is an implicit input that must contain pointer to a
   struct sk_buff.
 * Register R0 is an implicit output which contains the data fetched from
   the packet.
 * Registers R1-R5 are scratch registers that are clobbered after a call to
   ``BPF_ABS | BPF_LD`` or ``BPF_IND | BPF_LD`` instructions.

These instructions have an implicit program exit condition as well. When an
eBPF program is trying to access the data beyond the packet boundary, the
program execution will be aborted.

``BPF_ABS | BPF_W | BPF_LD`` means::

  R0 = ntohl(*(u32 *) (((struct sk_buff *) R6)->data + imm32))

``BPF_IND | BPF_W | BPF_LD`` means::

  R0 = ntohl(*(u32 *) (((struct sk_buff *) R6)->data + src_reg + imm32))
eBPF previously introduced special instructions for access to packet data that were
carried over from classic BPF. However, these instructions are
deprecated and should no longer be used.
Loading