Unverified Commit 51a55d4e authored by openeuler-ci-bot's avatar openeuler-ci-bot Committed by Gitee
Browse files

!14127 MPTCP Upstream part 24

Merge Pull Request from: @xlldkj 
 
This PR synchronizes patches from the upstream MPTCP (MultiPath TCP) to maintain consistency of MPTCP functionality between the openEuler and the upstream Linux. This ensures that the introduction of future features does not lead to merge conflicts. 
 
Link:https://gitee.com/openeuler/kernel/pulls/14127

 

Reviewed-by: default avatarYue Haibing <yuehaibing@huawei.com>
Signed-off-by: default avatarZhang Peng <zhangpeng362@huawei.com>
parents 018e26e4 248fef78
Loading
Loading
Loading
Loading
+1 −0
Original line number Diff line number Diff line
@@ -72,6 +72,7 @@ Contents:
   mac80211-injection
   mctp
   mpls-sysctl
   mptcp
   mptcp-sysctl
   multiqueue
   napi
+37 −33
Original line number Diff line number Diff line
@@ -7,14 +7,6 @@ MPTCP Sysfs variables
/proc/sys/net/mptcp/* Variables
===============================

enabled - BOOLEAN
	Control whether MPTCP sockets can be created.

	MPTCP sockets can be created if the value is 1. This is a
	per-namespace sysctl.

	Default: 1 (enabled)

add_addr_timeout - INTEGER (seconds)
	Set the timeout after which an ADD_ADDR control message will be
	resent to an MPTCP peer that has not acknowledged a previous
@@ -25,16 +17,22 @@ add_addr_timeout - INTEGER (seconds)

	Default: 120

close_timeout - INTEGER (seconds)
	Set the make-after-break timeout: in absence of any close or
	shutdown syscall, MPTCP sockets will maintain the status
	unchanged for such time, after the last subflow removal, before
	moving to TCP_CLOSE.
allow_join_initial_addr_port - BOOLEAN
	Allow peers to send join requests to the IP address and port number used
	by the initial subflow if the value is 1. This controls a flag that is
	sent to the peer at connection time, and whether such join requests are
	accepted or denied.

	The default value matches TCP_TIMEWAIT_LEN. This is a per-namespace
	sysctl.
	Joins to addresses advertised with ADD_ADDR are not affected by this
	value.

	Default: 60
	This is a per-namespace sysctl.

	Default: 1

available_schedulers - STRING
	Shows the available schedulers choices that are registered. More packet
	schedulers may be available, but not loaded.

checksum_enabled - BOOLEAN
	Control whether DSS checksum can be enabled.
@@ -44,18 +42,24 @@ checksum_enabled - BOOLEAN

	Default: 0

allow_join_initial_addr_port - BOOLEAN
	Allow peers to send join requests to the IP address and port number used
	by the initial subflow if the value is 1. This controls a flag that is
	sent to the peer at connection time, and whether such join requests are
	accepted or denied.
close_timeout - INTEGER (seconds)
	Set the make-after-break timeout: in absence of any close or
	shutdown syscall, MPTCP sockets will maintain the status
	unchanged for such time, after the last subflow removal, before
	moving to TCP_CLOSE.

	Joins to addresses advertised with ADD_ADDR are not affected by this
	value.
	The default value matches TCP_TIMEWAIT_LEN. This is a per-namespace
	sysctl.

	This is a per-namespace sysctl.
	Default: 60

	Default: 1
enabled - BOOLEAN
	Control whether MPTCP sockets can be created.

	MPTCP sockets can be created if the value is 1. This is a
	per-namespace sysctl.

	Default: 1 (enabled)

pm_type - INTEGER
	Set the default path manager type to use for each new MPTCP
@@ -74,6 +78,14 @@ pm_type - INTEGER

	Default: 0

scheduler - STRING
	Select the scheduler of your choice.

	Support for selection of different schedulers. This is a per-namespace
	sysctl.

	Default: "default"

stale_loss_cnt - INTEGER
	The number of MPTCP-level retransmission intervals with no traffic and
	pending outstanding data on a given subflow required to declare it stale.
@@ -85,11 +97,3 @@ stale_loss_cnt - INTEGER
	This is a per-namespace sysctl.

	Default: 4

scheduler - STRING
	Select the scheduler of your choice.

	Support for selection of different schedulers. This is a per-namespace
	sysctl.

	Default: "default"
+156 −0
Original line number Diff line number Diff line
.. SPDX-License-Identifier: GPL-2.0

=====================
Multipath TCP (MPTCP)
=====================

Introduction
============

Multipath TCP or MPTCP is an extension to the standard TCP and is described in
`RFC 8684 (MPTCPv1) <https://www.rfc-editor.org/rfc/rfc8684.html>`_. It allows a
device to make use of multiple interfaces at once to send and receive TCP
packets over a single MPTCP connection. MPTCP can aggregate the bandwidth of
multiple interfaces or prefer the one with the lowest latency. It also allows a
fail-over if one path is down, and the traffic is seamlessly reinjected on other
paths.

For more details about Multipath TCP in the Linux kernel, please see the
official website: `mptcp.dev <https://www.mptcp.dev>`_.


Use cases
=========

Thanks to MPTCP, being able to use multiple paths in parallel or simultaneously
brings new use-cases, compared to TCP:

- Seamless handovers: switching from one path to another while preserving
  established connections, e.g. to be used in mobility use-cases, like on
  smartphones.
- Best network selection: using the "best" available path depending on some
  conditions, e.g. latency, losses, cost, bandwidth, etc.
- Network aggregation: using multiple paths at the same time to have a higher
  throughput, e.g. to combine fixed and mobile networks to send files faster.


Concepts
========

Technically, when a new socket is created with the ``IPPROTO_MPTCP`` protocol
(Linux-specific), a *subflow* (or *path*) is created. This *subflow* consists of
a regular TCP connection that is used to transmit data through one interface.
Additional *subflows* can be negotiated later between the hosts. For the remote
host to be able to detect the use of MPTCP, a new field is added to the TCP
*option* field of the underlying TCP *subflow*. This field contains, amongst
other things, a ``MP_CAPABLE`` option that tells the other host to use MPTCP if
it is supported. If the remote host or any middlebox in between does not support
it, the returned ``SYN+ACK`` packet will not contain MPTCP options in the TCP
*option* field. In that case, the connection will be "downgraded" to plain TCP,
and it will continue with a single path.

This behavior is made possible by two internal components: the path manager, and
the packet scheduler.

Path Manager
------------

The Path Manager is in charge of *subflows*, from creation to deletion, and also
address announcements. Typically, it is the client side that initiates subflows,
and the server side that announces additional addresses via the ``ADD_ADDR`` and
``REMOVE_ADDR`` options.

Path managers are controlled by the ``net.mptcp.pm_type`` sysctl knob -- see
mptcp-sysctl.rst. There are two types: the in-kernel one (type ``0``) where the
same rules are applied for all the connections (see: ``ip mptcp``) ; and the
userspace one (type ``1``), controlled by a userspace daemon (i.e. `mptcpd
<https://mptcpd.mptcp.dev/>`_) where different rules can be applied for each
connection. The path managers can be controlled via a Netlink API; see
netlink_spec/mptcp_pm.rst.

To be able to use multiple IP addresses on a host to create multiple *subflows*
(paths), the default in-kernel MPTCP path-manager needs to know which IP
addresses can be used. This can be configured with ``ip mptcp endpoint`` for
example.

Packet Scheduler
----------------

The Packet Scheduler is in charge of selecting which available *subflow(s)* to
use to send the next data packet. It can decide to maximize the use of the
available bandwidth, only to pick the path with the lower latency, or any other
policy depending on the configuration.

Packet schedulers are controlled by the ``net.mptcp.scheduler`` sysctl knob --
see mptcp-sysctl.rst.


Sockets API
===========

Creating MPTCP sockets
----------------------

On Linux, MPTCP can be used by selecting MPTCP instead of TCP when creating the
``socket``:

.. code-block:: C

    int sd = socket(AF_INET(6), SOCK_STREAM, IPPROTO_MPTCP);

Note that ``IPPROTO_MPTCP`` is defined as ``262``.

If MPTCP is not supported, ``errno`` will be set to:

- ``EINVAL``: (*Invalid argument*): MPTCP is not available, on kernels < 5.6.
- ``EPROTONOSUPPORT`` (*Protocol not supported*): MPTCP has not been compiled,
  on kernels >= v5.6.
- ``ENOPROTOOPT`` (*Protocol not available*): MPTCP has been disabled using
  ``net.mptcp.enabled`` sysctl knob; see mptcp-sysctl.rst.

MPTCP is then opt-in: applications need to explicitly request it. Note that
applications can be forced to use MPTCP with different techniques, e.g.
``LD_PRELOAD`` (see ``mptcpize``), eBPF (see ``mptcpify``), SystemTAP,
``GODEBUG`` (``GODEBUG=multipathtcp=1``), etc.

Switching to ``IPPROTO_MPTCP`` instead of ``IPPROTO_TCP`` should be as
transparent as possible for the userspace applications.

Socket options
--------------

MPTCP supports most socket options handled by TCP. It is possible some less
common options are not supported, but contributions are welcome.

Generally, the same value is propagated to all subflows, including the ones
created after the calls to ``setsockopt()``. eBPF can be used to set different
values per subflow.

There are some MPTCP specific socket options at the ``SOL_MPTCP`` (284) level to
retrieve info. They fill the ``optval`` buffer of the ``getsockopt()`` system
call:

- ``MPTCP_INFO``: Uses ``struct mptcp_info``.
- ``MPTCP_TCPINFO``: Uses ``struct mptcp_subflow_data``, followed by an array of
  ``struct tcp_info``.
- ``MPTCP_SUBFLOW_ADDRS``: Uses ``struct mptcp_subflow_data``, followed by an
  array of ``mptcp_subflow_addrs``.
- ``MPTCP_FULL_INFO``: Uses ``struct mptcp_full_info``, with one pointer to an
  array of ``struct mptcp_subflow_info`` (including the
  ``struct mptcp_subflow_addrs``), and one pointer to an array of
  ``struct tcp_info``, followed by the content of ``struct mptcp_info``.

Note that at the TCP level, ``TCP_IS_MPTCP`` socket option can be used to know
if MPTCP is currently being used: the value will be set to 1 if it is.


Design choices
==============

A new socket type has been added for MPTCP for the userspace-facing socket. The
kernel is in charge of creating subflow sockets: they are TCP sockets where the
behavior is modified using TCP-ULP.

MPTCP listen sockets will create "plain" *accepted* TCP sockets if the
connection request from the client didn't ask for MPTCP, making the performance
impact minimal when MPTCP is enabled by default.
+1 −1
Original line number Diff line number Diff line
@@ -15043,7 +15043,7 @@ B: https://github.com/multipath-tcp/mptcp_net-next/issues
T:	git https://github.com/multipath-tcp/mptcp_net-next.git export-net
T:	git https://github.com/multipath-tcp/mptcp_net-next.git export
F:	Documentation/netlink/specs/mptcp_pm.yaml
F:	Documentation/networking/mptcp-sysctl.rst
F:	Documentation/networking/mptcp*.rst
F:	include/net/mptcp.h
F:	include/trace/events/mptcp.h
F:	include/uapi/linux/mptcp*.h
+4 −4
Original line number Diff line number Diff line
@@ -2055,13 +2055,13 @@ static void mptcp_rcv_space_adjust(struct mptcp_sock *msk, int copied)
		do_div(grow, msk->rcvq_space.space);
		rcvwin += (grow << 1);

		rcvbuf = min_t(u64, __tcp_space_from_win(scaling_ratio, rcvwin),
		rcvbuf = min_t(u64, mptcp_space_from_win(sk, rcvwin),
			       READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_rmem[2]));

		if (rcvbuf > sk->sk_rcvbuf) {
			u32 window_clamp;

			window_clamp = __tcp_win_from_space(scaling_ratio, rcvbuf);
			window_clamp = mptcp_win_from_space(sk, rcvbuf);
			WRITE_ONCE(sk->sk_rcvbuf, rcvbuf);

			/* Make subflows follow along.  If we do not do this, we
@@ -2217,7 +2217,7 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
		if (skb_queue_empty(&msk->receive_queue) && __mptcp_move_skbs(msk))
			continue;

		/* only the master socket status is relevant here. The exit
		/* only the MPTCP socket status is relevant here. The exit
		 * conditions mirror closely tcp_recvmsg()
		 */
		if (copied >= target)
@@ -3548,7 +3548,7 @@ void mptcp_subflow_process_delegated(struct sock *ssk, long status)
static int mptcp_hash(struct sock *sk)
{
	/* should never be called,
	 * we hash the TCP subflows not the master socket
	 * we hash the TCP subflows not the MPTCP socket
	 */
	WARN_ON_ONCE(1);
	return 0;
Loading