Commit 01e2d157 authored by David S. Miller's avatar David S. Miller
Browse files

Merge branch 'skb-mono-delivery-time'

Martin KaFai Lau says:

====================
Preserve mono delivery time (EDT) in skb->tstamp

skb->tstamp was first used as the (rcv) timestamp.
The major usage is to report it to the user (e.g. SO_TIMESTAMP).

Later, skb->tstamp is also set as the (future) delivery_time (e.g. EDT in TCP)
during egress and used by the qdisc (e.g. sch_fq) to make decision on when
the skb can be passed to the dev.

Currently, there is no way to tell skb->tstamp having the (rcv) timestamp
or the delivery_time, so it is always reset to 0 whenever forwarded
between egress and ingress.

While it makes sense to always clear the (rcv) timestamp in skb->tstamp
to avoid confusing sch_fq that expects the delivery_time, it is a
performance issue [0] to clear the delivery_time if the skb finally
egress to a fq@phy-dev.

This set is to keep the mono delivery time and make it available to
the final egress interface.  Please see individual patch for
the details.

[0] (slide 22): https://linuxplumbersconf.org/event/11/contributions/953/attachments/867/1658/LPC_2021_BPF_Datapath_Extensions.pdf



v6:
- Add kdoc and use non-UAPI type in patch 6 (Jakub)

v5:
netdev:
- Patch 3 in v4 is broken down into smaller patches 3, 4, and 5 in v5
- The mono_delivery_time bit clearing in __skb_tstamp_tx() is
  done in __net_timestamp() instead.  This is patch 4 in v5.
- Missed a skb_clear_delivery_time() for the 'skip_classify' case
  in dev.c in v4.  That is fixed in patch 5 in v5 for correctness.
  The skb_clear_delivery_time() will be moved to a later
  stage in Patch 10, so it was an intermediate error in v4.
- Added delivery time handling for nfnetlink_{log, queue}.c in patch 9 (Daniel)
- Added delivery time handling in the IPv6 IOAM hop-by-hop option which has
  an experimental IANA assigned value 49 in patch 8
- Added delivery time handling in nf_conntrack for the ipv6 defrag case
  in patch 7
- Removed unlikely() from testing skb->mono_delivery_time (Daniel)

bpf:
- Remove the skb->tstamp dance in ingress.  Depends on bpf insn
  rewrite to return 0 if skb->tstamp has delivery time in patch 11.
  It is to backward compatible with the existing tc-bpf@ingress in
  patch 11.
- bpf_set_delivery_time() will also allow dtime == 0 and
  dtime_type == BPF_SKB_DELIVERY_TIME_NONE as argument
  in patch 12.

v4:
netdev:
- Push the skb_clear_delivery_time() from
  ip_local_deliver() and ip6_input() to
  ip_local_deliver_finish() and ip6_input_finish()
  to accommodate the ipvs forward path.
  This is the notable change in v4 at the netdev side.

    - Patch 3/8 first does the skb_clear_delivery_time() after
      sch_handle_ingress() in dev.c and this will make the
      tc-bpf forward path work via the bpf_redirect_*() helper.

    - The next patch 4/8 (new in v4) will then postpone the
      skb_clear_delivery_time() from dev.c to
      the ip_local_deliver_finish() and ip6_input_finish() after
      taking care of the tstamp usage in the ip defrag case.
      This will make the kernel forward path also work, e.g.
      the ip[6]_forward().

- Fixed a case v3 which missed setting the skb->mono_delivery_time bit
  when sending TCP rst/ack in some cases (e.g. from a ctl_sk).
  That case happens at ip_send_unicast_reply() and
  tcp_v6_send_response().  It is fixed in patch 1/8 (and
  then patch 3/8) in v4.

bpf:
- Adding __sk_buff->delivery_time_type instead of adding
  __sk_buff->mono_delivery_time as in v3.  The tc-bpf can stay with
  one __sk_buff->tstamp instead of having two 'time' fields
  while one is 0 and another is not.
  tc-bpf can use the new __sk_buff->delivery_time_type to tell
  what is stored in __sk_buff->tstamp.
- bpf_skb_set_delivery_time() helper is added to set
  __sk_buff->tstamp from non mono delivery_time to
  mono delivery_time
- Most of the convert_ctx_access() bpf insn rewrite in v3
  is gone, so no new rewrite added for __sk_buff->tstamp.
  The only rewrite added is for reading the new
  __sk_buff->delivery_time_type.
- Added selftests, test_tc_dtime.c

v3:
- Feedback from v2 is using shinfo(skb)->tx_flags could be racy.
- Considered to reuse a few bits in skb->tstamp to represent
  different semantics, other than more code churns, it will break
  the bpf usecase which currently can write and then read back
  the skb->tstamp.
- Went back to v1 idea on adding a bit to skb and address the
  feedbacks on v1:
- Added one bit skb->mono_delivery_time to flag that
  the skb->tstamp has the mono delivery_time (EDT), instead
  of adding a bit to flag if the skb->tstamp has been forwarded or not.
- Instead of resetting the delivery_time back to the (rcv) timestamp
  during recvmsg syscall which may be too late and not useful,
  the delivery_time reset in v3 happens earlier once the stack
  knows that the skb will be delivered locally.
- Handled the tapping@ingress case by af_packet
- No need to change the (rcv) timestamp to mono clock base as in v1.
  The added one bit to flag skb->mono_delivery_time is enough
  to keep the EDT delivery_time during forward.
- Added logic to the bpf side to make the existing bpf
  running at ingress can still get the (rcv) timestamp
  when reading the __sk_buff->tstamp.  New __sk_buff->mono_delivery_time
  is also added.  Test is still needed to test this piece.
====================

Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
parents 6fb8661c c803475f
Loading
Loading
Loading
Loading
+1 −1
Original line number Diff line number Diff line
@@ -74,7 +74,7 @@ static netdev_tx_t loopback_xmit(struct sk_buff *skb,
	skb_tx_timestamp(skb);

	/* do not fool net_timestamp_check() with various clock bases */
	skb->tstamp = 0;
	skb_clear_tstamp(skb);

	skb_orphan(skb);

+2 −1
Original line number Diff line number Diff line
@@ -572,7 +572,8 @@ struct bpf_prog {
				has_callchain_buf:1, /* callchain buffer allocated? */
				enforce_expected_attach_type:1, /* Enforce expected_attach_type checking at attach time */
				call_get_stack:1, /* Do we call bpf_get_stack() or bpf_get_stackid() */
				call_get_func_ip:1; /* Do we call get_func_ip() */
				call_get_func_ip:1, /* Do we call get_func_ip() */
				delivery_time_access:1; /* Accessed __sk_buff->delivery_time_type */
	enum bpf_prog_type	type;		/* Type of BPF program */
	enum bpf_attach_type	expected_attach_type; /* For some prog types */
	u32			len;		/* Number of filter blocks */
+68 −6
Original line number Diff line number Diff line
@@ -795,6 +795,10 @@ typedef unsigned char *sk_buff_data_t;
 *	@dst_pending_confirm: need to confirm neighbour
 *	@decrypted: Decrypted SKB
 *	@slow_gro: state present at GRO time, slower prepare step required
 *	@mono_delivery_time: When set, skb->tstamp has the
 *		delivery_time in mono clock base (i.e. EDT).  Otherwise, the
 *		skb->tstamp has the (rcv) timestamp at ingress and
 *		delivery_time at egress.
 *	@napi_id: id of the NAPI struct this skb came from
 *	@sender_cpu: (aka @napi_id) source CPU in XPS
 *	@secmark: security marking
@@ -937,8 +941,12 @@ struct sk_buff {
	__u8			vlan_present:1;	/* See PKT_VLAN_PRESENT_BIT */
	__u8			csum_complete_sw:1;
	__u8			csum_level:2;
	__u8			csum_not_inet:1;
	__u8			dst_pending_confirm:1;
	__u8			mono_delivery_time:1;
#ifdef CONFIG_NET_CLS_ACT
	__u8			tc_skip_classify:1;
	__u8			tc_at_ingress:1;
#endif
#ifdef CONFIG_IPV6_NDISC_NODETYPE
	__u8			ndisc_nodetype:2;
#endif
@@ -949,10 +957,6 @@ struct sk_buff {
#ifdef CONFIG_NET_SWITCHDEV
	__u8			offload_fwd_mark:1;
	__u8			offload_l3_fwd_mark:1;
#endif
#ifdef CONFIG_NET_CLS_ACT
	__u8			tc_skip_classify:1;
	__u8			tc_at_ingress:1;
#endif
	__u8			redirected:1;
#ifdef CONFIG_NET_REDIRECT
@@ -965,6 +969,7 @@ struct sk_buff {
	__u8			decrypted:1;
#endif
	__u8			slow_gro:1;
	__u8			csum_not_inet:1;

#ifdef CONFIG_NET_SCHED
	__u16			tc_index;	/* traffic control index */
@@ -1042,10 +1047,16 @@ struct sk_buff {
/* if you move pkt_vlan_present around you also must adapt these constants */
#ifdef __BIG_ENDIAN_BITFIELD
#define PKT_VLAN_PRESENT_BIT	7
#define TC_AT_INGRESS_MASK		(1 << 0)
#define SKB_MONO_DELIVERY_TIME_MASK	(1 << 2)
#else
#define PKT_VLAN_PRESENT_BIT	0
#define TC_AT_INGRESS_MASK		(1 << 7)
#define SKB_MONO_DELIVERY_TIME_MASK	(1 << 5)
#endif
#define PKT_VLAN_PRESENT_OFFSET	offsetof(struct sk_buff, __pkt_vlan_present_offset)
#define TC_AT_INGRESS_OFFSET offsetof(struct sk_buff, __pkt_vlan_present_offset)
#define SKB_MONO_DELIVERY_TIME_OFFSET offsetof(struct sk_buff, __pkt_vlan_present_offset)

#ifdef __KERNEL__
/*
@@ -3976,6 +3987,7 @@ static inline void skb_get_new_timestampns(const struct sk_buff *skb,
static inline void __net_timestamp(struct sk_buff *skb)
{
	skb->tstamp = ktime_get_real();
	skb->mono_delivery_time = 0;
}

static inline ktime_t net_timedelta(ktime_t t)
@@ -3983,6 +3995,56 @@ static inline ktime_t net_timedelta(ktime_t t)
	return ktime_sub(ktime_get_real(), t);
}

static inline void skb_set_delivery_time(struct sk_buff *skb, ktime_t kt,
					 bool mono)
{
	skb->tstamp = kt;
	skb->mono_delivery_time = kt && mono;
}

DECLARE_STATIC_KEY_FALSE(netstamp_needed_key);

/* It is used in the ingress path to clear the delivery_time.
 * If needed, set the skb->tstamp to the (rcv) timestamp.
 */
static inline void skb_clear_delivery_time(struct sk_buff *skb)
{
	if (skb->mono_delivery_time) {
		skb->mono_delivery_time = 0;
		if (static_branch_unlikely(&netstamp_needed_key))
			skb->tstamp = ktime_get_real();
		else
			skb->tstamp = 0;
	}
}

static inline void skb_clear_tstamp(struct sk_buff *skb)
{
	if (skb->mono_delivery_time)
		return;

	skb->tstamp = 0;
}

static inline ktime_t skb_tstamp(const struct sk_buff *skb)
{
	if (skb->mono_delivery_time)
		return 0;

	return skb->tstamp;
}

static inline ktime_t skb_tstamp_cond(const struct sk_buff *skb, bool cond)
{
	if (!skb->mono_delivery_time && skb->tstamp)
		return skb->tstamp;

	if (static_branch_unlikely(&netstamp_needed_key) || cond)
		return ktime_get_real();

	return 0;
}

static inline u8 skb_metadata_len(const struct sk_buff *skb)
{
	return skb_shinfo(skb)->meta_len;
@@ -4839,7 +4901,7 @@ static inline void skb_set_redirected(struct sk_buff *skb, bool from_ingress)
#ifdef CONFIG_NET_REDIRECT
	skb->from_ingress = from_ingress;
	if (skb->from_ingress)
		skb->tstamp = 0;
		skb_clear_tstamp(skb);
#endif
}

+2 −0
Original line number Diff line number Diff line
@@ -70,6 +70,7 @@ struct frag_v6_compare_key {
 * @stamp: timestamp of the last received fragment
 * @len: total length of the original datagram
 * @meat: length of received fragments so far
 * @mono_delivery_time: stamp has a mono delivery time (EDT)
 * @flags: fragment queue flags
 * @max_size: maximum received fragment size
 * @fqdir: pointer to struct fqdir
@@ -90,6 +91,7 @@ struct inet_frag_queue {
	ktime_t			stamp;
	int			len;
	int			meat;
	u8			mono_delivery_time;
	__u8			flags;
	u16			max_size;
	struct fqdir		*fqdir;
+40 −1
Original line number Diff line number Diff line
@@ -5086,6 +5086,37 @@ union bpf_attr {
 *	Return
 *		0 on success, or a negative error in case of failure. On error
 *		*dst* buffer is zeroed out.
 *
 * long bpf_skb_set_delivery_time(struct sk_buff *skb, u64 dtime, u32 dtime_type)
 *	Description
 *		Set a *dtime* (delivery time) to the __sk_buff->tstamp and also
 *		change the __sk_buff->delivery_time_type to *dtime_type*.
 *
 *		When setting a delivery time (non zero *dtime*) to
 *		__sk_buff->tstamp, only BPF_SKB_DELIVERY_TIME_MONO *dtime_type*
 *		is supported.  It is the only delivery_time_type that will be
 *		kept after bpf_redirect_*().
 *
 *		If there is no need to change the __sk_buff->delivery_time_type,
 *		the delivery time can be directly written to __sk_buff->tstamp
 *		instead.
 *
 *		*dtime* 0 and *dtime_type* BPF_SKB_DELIVERY_TIME_NONE
 *		can be used to clear any delivery time stored in
 *		__sk_buff->tstamp.
 *
 *		Only IPv4 and IPv6 skb->protocol are supported.
 *
 *		This function is most useful when it needs to set a
 *		mono delivery time to __sk_buff->tstamp and then
 *		bpf_redirect_*() to the egress of an iface.  For example,
 *		changing the (rcv) timestamp in __sk_buff->tstamp at
 *		ingress to a mono delivery time and then bpf_redirect_*()
 *		to sch_fq@phy-dev.
 *	Return
 *		0 on success.
 *		**-EINVAL** for invalid input
 *		**-EOPNOTSUPP** for unsupported delivery_time_type and protocol
 */
#define __BPF_FUNC_MAPPER(FN)		\
	FN(unspec),			\
@@ -5280,6 +5311,7 @@ union bpf_attr {
	FN(xdp_load_bytes),		\
	FN(xdp_store_bytes),		\
	FN(copy_from_user_task),	\
	FN(skb_set_delivery_time),      \
	/* */

/* integer value in 'imm' field of BPF_CALL instruction selects which helper
@@ -5469,6 +5501,12 @@ union { \
	__u64 :64;			\
} __attribute__((aligned(8)))

enum {
	BPF_SKB_DELIVERY_TIME_NONE,
	BPF_SKB_DELIVERY_TIME_UNSPEC,
	BPF_SKB_DELIVERY_TIME_MONO,
};

/* user accessible mirror of in-kernel sk_buff.
 * new fields can only be added to the end of this structure
 */
@@ -5509,7 +5547,8 @@ struct __sk_buff {
	__u32 gso_segs;
	__bpf_md_ptr(struct bpf_sock *, sk);
	__u32 gso_size;
	__u32 :32;		/* Padding, future use. */
	__u8  delivery_time_type;
	__u32 :24;		/* Padding, future use. */
	__u64 hwtstamp;
};

Loading