Commit 4c0c6e4c authored by Jakub Kicinski's avatar Jakub Kicinski
Browse files

Merge branch 'docs-document-some-aspects-of-struct-sk_buff'

Jakub Kicinski says:

====================
docs: document some aspects of struct sk_buff

This small set creates a place to render sk_buff documentation,
documents one random thing (data-only skbs) and converts the big
checksum comment to kdoc.
====================

Link: https://lore.kernel.org/r/20220323233715.2104106-1-kuba@kernel.org/
Link: https://lore.kernel.org/r/20220324231312.2241166-1-kuba@kernel.org/
Link: https://lore.kernel.org/r/20220509160456.1058940-1-kuba@kernel.org


Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
parents be76955d 9facd941
Loading
Loading
Loading
Loading
+1 −0
Original line number Diff line number Diff line
@@ -97,6 +97,7 @@ Contents:
   sctp
   secid
   seg6-sysctl
   skbuff
   smc-sysctl
   statistics
   strparser
+37 −0
Original line number Diff line number Diff line
.. SPDX-License-Identifier: GPL-2.0

struct sk_buff
==============

:c:type:`sk_buff` is the main networking structure representing
a packet.

Basic sk_buff geometry
----------------------

.. kernel-doc:: include/linux/skbuff.h
   :doc: Basic sk_buff geometry

Shared skbs and skb clones
--------------------------

:c:member:`sk_buff.users` is a simple refcount allowing multiple entities
to keep a struct sk_buff alive. skbs with a ``sk_buff.users != 1`` are referred
to as shared skbs (see skb_shared()).

skb_clone() allows for fast duplication of skbs. None of the data buffers
get copied, but caller gets a new metadata struct (struct sk_buff).
&skb_shared_info.refcount indicates the number of skbs pointing at the same
packet data (i.e. clones).

dataref and headerless skbs
---------------------------

.. kernel-doc:: include/linux/skbuff.h
   :doc: dataref and headerless skbs

Checksum information
--------------------

.. kernel-doc:: include/linux/skbuff.h
   :doc: skb checksums
+194 −107
Original line number Diff line number Diff line
@@ -43,25 +43,32 @@
#include <linux/netfilter/nf_conntrack_common.h>
#endif

/* The interface for checksum offload between the stack and networking drivers
/**
 * DOC: skb checksums
 *
 * The interface for checksum offload between the stack and networking drivers
 * is as follows...
 *
 * A. IP checksum related features
 * IP checksum related features
 * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 *
 * Drivers advertise checksum offload capabilities in the features of a device.
 * From the stack's point of view these are capabilities offered by the driver.
 * A driver typically only advertises features that it is capable of offloading
 * to its device.
 *
 * The checksum related features are:
 * .. flat-table:: Checksum related device features
 *   :widths: 1 10
 *
 *	NETIF_F_HW_CSUM	- The driver (or its device) is able to compute one
 *   * - %NETIF_F_HW_CSUM
 *     - The driver (or its device) is able to compute one
 *	 IP (one's complement) checksum for any combination
 *	 of protocols or protocol layering. The checksum is
 *	 computed and set in a packet per the CHECKSUM_PARTIAL
 *	 interface (see below).
 *
 *	NETIF_F_IP_CSUM - Driver (device) is only able to checksum plain
 *   * - %NETIF_F_IP_CSUM
 *     - Driver (device) is only able to checksum plain
 *	 TCP or UDP packets over IPv4. These are specifically
 *	 unencapsulated packets of the form IPv4|TCP or
 *	 IPv4|UDP where the Protocol field in the IPv4 header
@@ -70,7 +77,8 @@
 *	 with NETIF_F_HW_CSUM also set. This feature is being
 *	 DEPRECATED (see below).
 *
 *	NETIF_F_IPV6_CSUM - Driver (device) is only able to checksum plain
 *   * - %NETIF_F_IPV6_CSUM
 *     - Driver (device) is only able to checksum plain
 *	 TCP or UDP packets over IPv6. These are specifically
 *	 unencapsulated packets of the form IPv6|TCP or
 *	 IPv6|UDP where the Next Header field in the IPv6
@@ -80,61 +88,67 @@
 *	 NETIF_F_HW_CSUM also set. This feature is being
 *	 DEPRECATED (see below).
 *
 *	NETIF_F_RXCSUM - Driver (device) performs receive checksum offload.
 *   * - %NETIF_F_RXCSUM
 *     - Driver (device) performs receive checksum offload.
 *	 This flag is only used to disable the RX checksum
 *	 feature for a device. The stack will accept receive
 *	 checksum indication in packets received on a device
 *	 regardless of whether NETIF_F_RXCSUM is set.
 *
 * B. Checksumming of received packets by device. Indication of checksum
 *    verification is set in skb->ip_summed. Possible values are:
 * Checksumming of received packets by device
 * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 *
 * Indication of checksum verification is set in &sk_buff.ip_summed.
 * Possible values are:
 *
 * CHECKSUM_NONE:
 * - %CHECKSUM_NONE
 *
 *   Device did not checksum this packet e.g. due to lack of capabilities.
 *   The packet contains full (though not verified) checksum in packet but
 *   not in skb->csum. Thus, skb->csum is undefined in this case.
 *
 * CHECKSUM_UNNECESSARY:
 * - %CHECKSUM_UNNECESSARY
 *
 *   The hardware you're dealing with doesn't calculate the full checksum
 *   (as in CHECKSUM_COMPLETE), but it does parse headers and verify checksums
 *   for specific protocols. For such packets it will set CHECKSUM_UNNECESSARY
 *   if their checksums are okay. skb->csum is still undefined in this case
 *   (as in %CHECKSUM_COMPLETE), but it does parse headers and verify checksums
 *   for specific protocols. For such packets it will set %CHECKSUM_UNNECESSARY
 *   if their checksums are okay. &sk_buff.csum is still undefined in this case
 *   though. A driver or device must never modify the checksum field in the
 *   packet even if checksum is verified.
 *
 *   CHECKSUM_UNNECESSARY is applicable to following protocols:
 *     TCP: IPv6 and IPv4.
 *     UDP: IPv4 and IPv6. A device may apply CHECKSUM_UNNECESSARY to a
 *   %CHECKSUM_UNNECESSARY is applicable to following protocols:
 *
 *     - TCP: IPv6 and IPv4.
 *     - UDP: IPv4 and IPv6. A device may apply CHECKSUM_UNNECESSARY to a
 *       zero UDP checksum for either IPv4 or IPv6, the networking stack
 *       may perform further validation in this case.
 *     GRE: only if the checksum is present in the header.
 *     SCTP: indicates the CRC in SCTP header has been validated.
 *     FCOE: indicates the CRC in FC frame has been validated.
 *     - GRE: only if the checksum is present in the header.
 *     - SCTP: indicates the CRC in SCTP header has been validated.
 *     - FCOE: indicates the CRC in FC frame has been validated.
 *
 *   skb->csum_level indicates the number of consecutive checksums found in
 *   the packet minus one that have been verified as CHECKSUM_UNNECESSARY.
 *   &sk_buff.csum_level indicates the number of consecutive checksums found in
 *   the packet minus one that have been verified as %CHECKSUM_UNNECESSARY.
 *   For instance if a device receives an IPv6->UDP->GRE->IPv4->TCP packet
 *   and a device is able to verify the checksums for UDP (possibly zero),
 *   GRE (checksum flag is set) and TCP, skb->csum_level would be set to
 *   GRE (checksum flag is set) and TCP, &sk_buff.csum_level would be set to
 *   two. If the device were only able to verify the UDP checksum and not
 *   GRE, either because it doesn't support GRE checksum or because GRE
 *   checksum is bad, skb->csum_level would be set to zero (TCP checksum is
 *   not considered in this case).
 *
 * CHECKSUM_COMPLETE:
 * - %CHECKSUM_COMPLETE
 *
 *   This is the most generic way. The device supplied checksum of the _whole_
 *   packet as seen by netif_rx() and fills in skb->csum. This means the
 *   packet as seen by netif_rx() and fills in &sk_buff.csum. This means the
 *   hardware doesn't need to parse L3/L4 headers to implement this.
 *
 *   Notes:
 *
 *   - Even if device supports only some protocols, but is able to produce
 *     skb->csum, it MUST use CHECKSUM_COMPLETE, not CHECKSUM_UNNECESSARY.
 *   - CHECKSUM_COMPLETE is not applicable to SCTP and FCoE protocols.
 *
 * CHECKSUM_PARTIAL:
 * - %CHECKSUM_PARTIAL
 *
 *   A checksum is set up to be offloaded to a device as described in the
 *   output description for CHECKSUM_PARTIAL. This may occur on a packet
@@ -146,14 +160,18 @@
 *   packet that are after the checksum being offloaded are not considered to
 *   be verified.
 *
 * C. Checksumming on transmit for non-GSO. The stack requests checksum offload
 *    in the skb->ip_summed for a packet. Values are:
 * Checksumming on transmit for non-GSO
 * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 *
 * The stack requests checksum offload in the &sk_buff.ip_summed for a packet.
 * Values are:
 *
 * CHECKSUM_PARTIAL:
 * - %CHECKSUM_PARTIAL
 *
 *   The driver is required to checksum the packet as seen by hard_start_xmit()
 *   from skb->csum_start up to the end, and to record/write the checksum at
 *   offset skb->csum_start + skb->csum_offset. A driver may verify that the
 *   from &sk_buff.csum_start up to the end, and to record/write the checksum at
 *   offset &sk_buff.csum_start + &sk_buff.csum_offset.
 *   A driver may verify that the
 *   csum_start and csum_offset values are valid values given the length and
 *   offset of the packet, but it should not attempt to validate that the
 *   checksum refers to a legitimate transport layer checksum -- it is the
@@ -165,55 +183,66 @@
 *   checksum calculation to the device, or call skb_checksum_help (in the case
 *   that the device does not support offload for a particular checksum).
 *
 *   NETIF_F_IP_CSUM and NETIF_F_IPV6_CSUM are being deprecated in favor of
 *   NETIF_F_HW_CSUM. New devices should use NETIF_F_HW_CSUM to indicate
 *   %NETIF_F_IP_CSUM and %NETIF_F_IPV6_CSUM are being deprecated in favor of
 *   %NETIF_F_HW_CSUM. New devices should use %NETIF_F_HW_CSUM to indicate
 *   checksum offload capability.
 *   skb_csum_hwoffload_help() can be called to resolve CHECKSUM_PARTIAL based
 *   skb_csum_hwoffload_help() can be called to resolve %CHECKSUM_PARTIAL based
 *   on network device checksumming capabilities: if a packet does not match
 *   them, skb_checksum_help or skb_crc32c_help (depending on the value of
 *   csum_not_inet, see item D.) is called to resolve the checksum.
 *   them, skb_checksum_help() or skb_crc32c_help() (depending on the value of
 *   &sk_buff.csum_not_inet, see :ref:`crc`)
 *   is called to resolve the checksum.
 *
 * CHECKSUM_NONE:
 * - %CHECKSUM_NONE
 *
 *   The skb was already checksummed by the protocol, or a checksum is not
 *   required.
 *
 * CHECKSUM_UNNECESSARY:
 * - %CHECKSUM_UNNECESSARY
 *
 *   This has the same meaning as CHECKSUM_NONE for checksum offload on
 *   output.
 *
 * CHECKSUM_COMPLETE:
 * - %CHECKSUM_COMPLETE
 *
 *   Not used in checksum output. If a driver observes a packet with this value
 *   set in skbuff, it should treat the packet as if CHECKSUM_NONE were set.
 *   set in skbuff, it should treat the packet as if %CHECKSUM_NONE were set.
 *
 * .. _crc:
 *
 * Non-IP checksum (CRC) offloads
 * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 *
 * D. Non-IP checksum (CRC) offloads
 * .. flat-table::
 *   :widths: 1 10
 *
 *   NETIF_F_SCTP_CRC - This feature indicates that a device is capable of
 *   * - %NETIF_F_SCTP_CRC
 *     - This feature indicates that a device is capable of
 *	 offloading the SCTP CRC in a packet. To perform this offload the stack
 *	 will set csum_start and csum_offset accordingly, set ip_summed to
 *     CHECKSUM_PARTIAL and set csum_not_inet to 1, to provide an indication in
 *     the skbuff that the CHECKSUM_PARTIAL refers to CRC32c.
 *	 %CHECKSUM_PARTIAL and set csum_not_inet to 1, to provide an indication
 *	 in the skbuff that the %CHECKSUM_PARTIAL refers to CRC32c.
 *	 A driver that supports both IP checksum offload and SCTP CRC32c offload
 *	 must verify which offload is configured for a packet by testing the
 *     value of skb->csum_not_inet; skb_crc32c_csum_help is provided to resolve
 *     CHECKSUM_PARTIAL on skbs where csum_not_inet is set to 1.
 *	 value of &sk_buff.csum_not_inet; skb_crc32c_csum_help() is provided to
 *	 resolve %CHECKSUM_PARTIAL on skbs where csum_not_inet is set to 1.
 *
 *   NETIF_F_FCOE_CRC - This feature indicates that a device is capable of
 *     offloading the FCOE CRC in a packet. To perform this offload the stack
 *     will set ip_summed to CHECKSUM_PARTIAL and set csum_start and csum_offset
 *   * - %NETIF_F_FCOE_CRC
 *     - This feature indicates that a device is capable of offloading the FCOE
 *	 CRC in a packet. To perform this offload the stack will set ip_summed
 *	 to %CHECKSUM_PARTIAL and set csum_start and csum_offset
 *	 accordingly. Note that there is no indication in the skbuff that the
 *     CHECKSUM_PARTIAL refers to an FCOE checksum, so a driver that supports
 *	 %CHECKSUM_PARTIAL refers to an FCOE checksum, so a driver that supports
 *	 both IP checksum offload and FCOE CRC offload must verify which offload
 *	 is configured for a packet, presumably by inspecting packet headers.
 *
 * E. Checksumming on output with GSO.
 * Checksumming on output with GSO
 * ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 *
 * In the case of a GSO packet (skb_is_gso(skb) is true), checksum offload
 * In the case of a GSO packet (skb_is_gso() is true), checksum offload
 * is implied by the SKB_GSO_* flags in gso_type. Most obviously, if the
 * gso_type is SKB_GSO_TCPV4 or SKB_GSO_TCPV6, TCP checksum offload as
 * gso_type is %SKB_GSO_TCPV4 or %SKB_GSO_TCPV6, TCP checksum offload as
 * part of the GSO operation is implied. If a checksum is being offloaded
 * with GSO then ip_summed is CHECKSUM_PARTIAL, and both csum_start and
 * with GSO then ip_summed is %CHECKSUM_PARTIAL, and both csum_start and
 * csum_offset are set to refer to the outermost checksum being offloaded
 * (two offloaded checksums are possible with UDP encapsulation).
 */
@@ -727,16 +756,32 @@ struct skb_shared_info {
	skb_frag_t	frags[MAX_SKB_FRAGS];
};

/* We divide dataref into two halves.  The higher 16 bits hold references
 * to the payload part of skb->data.  The lower 16 bits hold references to
 * the entire skb->data.  A clone of a headerless skb holds the length of
 * the header in skb->hdr_len.
 *
 * All users must obey the rule that the skb->data reference count must be
 * greater than or equal to the payload reference count.
 *
 * Holding a reference to the payload part means that the user does not
 * care about modifications to the header part of skb->data.
/**
 * DOC: dataref and headerless skbs
 *
 * Transport layers send out clones of payload skbs they hold for
 * retransmissions. To allow lower layers of the stack to prepend their headers
 * we split &skb_shared_info.dataref into two halves.
 * The lower 16 bits count the overall number of references.
 * The higher 16 bits indicate how many of the references are payload-only.
 * skb_header_cloned() checks if skb is allowed to add / write the headers.
 *
 * The creator of the skb (e.g. TCP) marks its skb as &sk_buff.nohdr
 * (via __skb_header_release()). Any clone created from marked skb will get
 * &sk_buff.hdr_len populated with the available headroom.
 * If there's the only clone in existence it's able to modify the headroom
 * at will. The sequence of calls inside the transport layer is::
 *
 *  <alloc skb>
 *  skb_reserve()
 *  __skb_header_release()
 *  skb_clone()
 *  // send the clone down the stack
 *
 * This is not a very generic construct and it depends on the transport layers
 * doing the right thing. In practice there's usually only one payload-only skb.
 * Having multiple payload-only skbs with different lengths of hdr_len is not
 * possible. The payload-only skbs should never leave their owner.
 */
#define SKB_DATAREF_SHIFT 16
#define SKB_DATAREF_MASK ((1 << SKB_DATAREF_SHIFT) - 1)
@@ -800,6 +845,46 @@ typedef unsigned int sk_buff_data_t;
typedef unsigned char *sk_buff_data_t;
#endif

/**
 * DOC: Basic sk_buff geometry
 *
 * struct sk_buff itself is a metadata structure and does not hold any packet
 * data. All the data is held in associated buffers.
 *
 * &sk_buff.head points to the main "head" buffer. The head buffer is divided
 * into two parts:
 *
 *  - data buffer, containing headers and sometimes payload;
 *    this is the part of the skb operated on by the common helpers
 *    such as skb_put() or skb_pull();
 *  - shared info (struct skb_shared_info) which holds an array of pointers
 *    to read-only data in the (page, offset, length) format.
 *
 * Optionally &skb_shared_info.frag_list may point to another skb.
 *
 * Basic diagram may look like this::
 *
 *                                  ---------------
 *                                 | sk_buff       |
 *                                  ---------------
 *     ,---------------------------  + head
 *    /          ,-----------------  + data
 *   /          /      ,-----------  + tail
 *  |          |      |            , + end
 *  |          |      |           |
 *  v          v      v           v
 *   -----------------------------------------------
 *  | headroom | data |  tailroom | skb_shared_info |
 *   -----------------------------------------------
 *                                 + [page frag]
 *                                 + [page frag]
 *                                 + [page frag]
 *                                 + [page frag]       ---------
 *                                 + frag_list    --> | sk_buff |
 *                                                     ---------
 *
 */

/**
 *	struct sk_buff - socket buffer
 *	@next: Next buffer in list
@@ -1987,8 +2072,10 @@ static inline int skb_header_unclone(struct sk_buff *skb, gfp_t pri)
}

/**
 *	__skb_header_release - release reference to header
 * __skb_header_release() - allow clones to use the headroom
 * @skb: buffer to operate on
 *
 * See "DOC: dataref and headerless skbs".
 */
static inline void __skb_header_release(struct sk_buff *skb)
{