Commit e8b9f0da authored by Paolo Abeni's avatar Paolo Abeni
Browse files

Merge branch 'dsa-changes-for-multiple-cpu-ports-part-4'

Vladimir Oltean says:

====================
DSA changes for multiple CPU ports (part 4)

Those who have been following part 1:
https://patchwork.kernel.org/project/netdevbpf/cover/20220511095020.562461-1-vladimir.oltean@nxp.com/
part 2:
https://patchwork.kernel.org/project/netdevbpf/cover/20220521213743.2735445-1-vladimir.oltean@nxp.com/
and part 3:
https://patchwork.kernel.org/project/netdevbpf/cover/20220819174820.3585002-1-vladimir.oltean@nxp.com/
will know that I am trying to enable the second internal port pair from
the NXP LS1028A Felix switch for DSA-tagged traffic via "ocelot-8021q".

This series represents the final part of that effort. We have:

- the introduction of new UAPI in the form of IFLA_DSA_MASTER, the
  iproute2 patch for which is here:
  https://patchwork.kernel.org/project/netdevbpf/patch/20220904190025.813574-1-vladimir.oltean@nxp.com/

- preparation for LAG DSA masters in terms of suppressing some
  operations for masters in the DSA core that simply don't make sense
  when those masters are a bonding/team interface

- handling all the net device events that occur between DSA and a
  LAG DSA master, including migration to a different DSA master when the
  current master joins a LAG, or the LAG gets destroyed

- updating documentation

- adding an implementation for NXP LS1028A, where things are insanely
  complicated due to hardware limitations. We have 2 tagging protocols:

  * the native "ocelot" protocol (NPI port mode). This does not support
    CPU ports in a LAG, and supports a single DSA master. The DSA master
    can be changed between eno2 (2.5G) and eno3 (1G), but all ports must
    be down during the changing process, and user ports assigned to the
    old DSA master will refuse to come up if the user requests that
    during a "transient" state.

  * the "ocelot-8021q" software-defined protocol, where the Ethernet
    ports connected to the CPU are not actually "god mode" ports as far
    as the hardware is concerned. So here, static assignment between
    user and CPU ports is possible by editing the PGID_SRC masks for
    the port-based forwarding matrix, and "CPU ports in a LAG" simply
    means "a LAG like any other".

The series was regression-tested on LS1028A using the local_termination.sh
kselftest, in most of the possible operating modes and tagging protocols.
I have not done a detailed performance evaluation yet, but using LAG, is
possible to exceed the termination bandwidth of a single CPU port in an
iperf3 test with multiple senders and multiple receivers.

v1 at:
https://patchwork.kernel.org/project/netdevbpf/cover/20220830195932.683432-1-vladimir.oltean@nxp.com/

Previous (older) RFC at:
https://lore.kernel.org/netdev/20220523104256.3556016-1-olteanv@gmail.com/
====================

Link: https://lore.kernel.org/r/20220911010706.2137967-1-vladimir.oltean@nxp.com


Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
parents 42e53b44 eca70102
Loading
Loading
Loading
Loading
+96 −0
Original line number Diff line number Diff line
@@ -49,6 +49,9 @@ In this documentation the following Ethernet interfaces are used:
*eth0*
  the master interface

*eth1*
  another master interface

*lan1*
  a slave interface

@@ -360,3 +363,96 @@ the ``self`` flag) has been removed. This results in the following changes:

Script writers are therefore encouraged to use the ``master static`` set of
flags when working with bridge FDB entries on DSA switch interfaces.

Affinity of user ports to CPU ports
-----------------------------------

Typically, DSA switches are attached to the host via a single Ethernet
interface, but in cases where the switch chip is discrete, the hardware design
may permit the use of 2 or more ports connected to the host, for an increase in
termination throughput.

DSA can make use of multiple CPU ports in two ways. First, it is possible to
statically assign the termination traffic associated with a certain user port
to be processed by a certain CPU port. This way, user space can implement
custom policies of static load balancing between user ports, by spreading the
affinities according to the available CPU ports.

Secondly, it is possible to perform load balancing between CPU ports on a per
packet basis, rather than statically assigning user ports to CPU ports.
This can be achieved by placing the DSA masters under a LAG interface (bonding
or team). DSA monitors this operation and creates a mirror of this software LAG
on the CPU ports facing the physical DSA masters that constitute the LAG slave
devices.

To make use of multiple CPU ports, the firmware (device tree) description of
the switch must mark all the links between CPU ports and their DSA masters
using the ``ethernet`` reference/phandle. At startup, only a single CPU port
and DSA master will be used - the numerically first port from the firmware
description which has an ``ethernet`` property. It is up to the user to
configure the system for the switch to use other masters.

DSA uses the ``rtnl_link_ops`` mechanism (with a "dsa" ``kind``) to allow
changing the DSA master of a user port. The ``IFLA_DSA_MASTER`` u32 netlink
attribute contains the ifindex of the master device that handles each slave
device. The DSA master must be a valid candidate based on firmware node
information, or a LAG interface which contains only slaves which are valid
candidates.

Using iproute2, the following manipulations are possible:

  .. code-block:: sh

    # See the DSA master in current use
    ip -d link show dev swp0
        (...)
        dsa master eth0

    # Static CPU port distribution
    ip link set swp0 type dsa master eth1
    ip link set swp1 type dsa master eth0
    ip link set swp2 type dsa master eth1
    ip link set swp3 type dsa master eth0

    # CPU ports in LAG, using explicit assignment of the DSA master
    ip link add bond0 type bond mode balance-xor && ip link set bond0 up
    ip link set eth1 down && ip link set eth1 master bond0
    ip link set swp0 type dsa master bond0
    ip link set swp1 type dsa master bond0
    ip link set swp2 type dsa master bond0
    ip link set swp3 type dsa master bond0
    ip link set eth0 down && ip link set eth0 master bond0
    ip -d link show dev swp0
        (...)
        dsa master bond0

    # CPU ports in LAG, relying on implicit migration of the DSA master
    ip link add bond0 type bond mode balance-xor && ip link set bond0 up
    ip link set eth0 down && ip link set eth0 master bond0
    ip link set eth1 down && ip link set eth1 master bond0
    ip -d link show dev swp0
        (...)
        dsa master bond0

Notice that in the case of CPU ports under a LAG, the use of the
``IFLA_DSA_MASTER`` netlink attribute is not strictly needed, but rather, DSA
reacts to the ``IFLA_MASTER`` attribute change of its present master (``eth0``)
and migrates all user ports to the new upper of ``eth0``, ``bond0``. Similarly,
when ``bond0`` is destroyed using ``RTM_DELLINK``, DSA migrates the user ports
that were assigned to this interface to the first physical DSA master which is
eligible, based on the firmware description (it effectively reverts to the
startup configuration).

In a setup with more than 2 physical CPU ports, it is therefore possible to mix
static user to CPU port assignment with LAG between DSA masters. It is not
possible to statically assign a user port towards a DSA master that has any
upper interfaces (this includes LAG devices - the master must always be the LAG
in this case).

Live changing of the DSA master (and thus CPU port) affinity of a user port is
permitted, in order to allow dynamic redistribution in response to traffic.

Physical DSA masters are allowed to join and leave at any time a LAG interface
used as a DSA master; however, DSA will reject a LAG interface as a valid
candidate for being a DSA master unless it has at least one physical DSA master
as a slave device.
+32 −6
Original line number Diff line number Diff line
@@ -303,6 +303,20 @@ These frames are then queued for transmission using the master network device
Ethernet switch will be able to process these incoming frames from the
management interface and deliver them to the physical switch port.

When using multiple CPU ports, it is possible to stack a LAG (bonding/team)
device between the DSA slave devices and the physical DSA masters. The LAG
device is thus also a DSA master, but the LAG slave devices continue to be DSA
masters as well (just with no user port assigned to them; this is needed for
recovery in case the LAG DSA master disappears). Thus, the data path of the LAG
DSA master is used asymmetrically. On RX, the ``ETH_P_XDSA`` handler, which
calls ``dsa_switch_rcv()``, is invoked early (on the physical DSA master;
LAG slave). Therefore, the RX data path of the LAG DSA master is not used.
On the other hand, TX takes place linearly: ``dsa_slave_xmit`` calls
``dsa_enqueue_skb``, which calls ``dev_queue_xmit`` towards the LAG DSA master.
The latter calls ``dev_queue_xmit`` towards one physical DSA master or the
other, and in both cases, the packet exits the system through a hardware path
towards the switch.

Graphical representation
------------------------

@@ -629,6 +643,24 @@ Switch configuration
  PHY cannot be found. In this case, probing of the DSA switch continues
  without that particular port.

- ``port_change_master``: method through which the affinity (association used
  for traffic termination purposes) between a user port and a CPU port can be
  changed. By default all user ports from a tree are assigned to the first
  available CPU port that makes sense for them (most of the times this means
  the user ports of a tree are all assigned to the same CPU port, except for H
  topologies as described in commit 2c0b03258b8b). The ``port`` argument
  represents the index of the user port, and the ``master`` argument represents
  the new DSA master ``net_device``. The CPU port associated with the new
  master can be retrieved by looking at ``struct dsa_port *cpu_dp =
  master->dsa_ptr``. Additionally, the master can also be a LAG device where
  all the slave devices are physical DSA masters. LAG DSA masters also have a
  valid ``master->dsa_ptr`` pointer, however this is not unique, but rather a
  duplicate of the first physical DSA master's (LAG slave) ``dsa_ptr``. In case
  of a LAG DSA master, a further call to ``port_lag_join`` will be emitted
  separately for the physical CPU ports associated with the physical DSA
  masters, requesting them to create a hardware LAG associated with the LAG
  interface.

PHY devices and link management
-------------------------------

@@ -1095,9 +1127,3 @@ capable hardware, but does not enforce a strict switch device driver model. On
the other DSA enforces a fairly strict device driver model, and deals with most
of the switch specific. At some point we should envision a merger between these
two subsystems and get the best of both worlds.

Other hanging fruits
--------------------

- allowing more than one CPU/management interface:
  http://comments.gmane.org/gmane.linux.network/365657
+2 −2
Original line number Diff line number Diff line
@@ -983,7 +983,7 @@ static int bcm_sf2_sw_resume(struct dsa_switch *ds)
static void bcm_sf2_sw_get_wol(struct dsa_switch *ds, int port,
			       struct ethtool_wolinfo *wol)
{
	struct net_device *p = dsa_to_port(ds, port)->cpu_dp->master;
	struct net_device *p = dsa_port_to_master(dsa_to_port(ds, port));
	struct bcm_sf2_priv *priv = bcm_sf2_to_priv(ds);
	struct ethtool_wolinfo pwol = { };

@@ -1007,7 +1007,7 @@ static void bcm_sf2_sw_get_wol(struct dsa_switch *ds, int port,
static int bcm_sf2_sw_set_wol(struct dsa_switch *ds, int port,
			      struct ethtool_wolinfo *wol)
{
	struct net_device *p = dsa_to_port(ds, port)->cpu_dp->master;
	struct net_device *p = dsa_port_to_master(dsa_to_port(ds, port));
	struct bcm_sf2_priv *priv = bcm_sf2_to_priv(ds);
	s8 cpu_port = dsa_to_port(ds, port)->cpu_dp->index;
	struct ethtool_wolinfo pwol =  { };
+2 −2
Original line number Diff line number Diff line
@@ -1102,7 +1102,7 @@ static int bcm_sf2_cfp_rule_get_all(struct bcm_sf2_priv *priv,
int bcm_sf2_get_rxnfc(struct dsa_switch *ds, int port,
		      struct ethtool_rxnfc *nfc, u32 *rule_locs)
{
	struct net_device *p = dsa_to_port(ds, port)->cpu_dp->master;
	struct net_device *p = dsa_port_to_master(dsa_to_port(ds, port));
	struct bcm_sf2_priv *priv = bcm_sf2_to_priv(ds);
	int ret = 0;

@@ -1145,7 +1145,7 @@ int bcm_sf2_get_rxnfc(struct dsa_switch *ds, int port,
int bcm_sf2_set_rxnfc(struct dsa_switch *ds, int port,
		      struct ethtool_rxnfc *nfc)
{
	struct net_device *p = dsa_to_port(ds, port)->cpu_dp->master;
	struct net_device *p = dsa_port_to_master(dsa_to_port(ds, port));
	struct bcm_sf2_priv *priv = bcm_sf2_to_priv(ds);
	int ret = 0;

+2 −2
Original line number Diff line number Diff line
@@ -1092,7 +1092,7 @@ static int lan9303_port_enable(struct dsa_switch *ds, int port,
	if (!dsa_port_is_user(dp))
		return 0;

	vlan_vid_add(dp->cpu_dp->master, htons(ETH_P_8021Q), port);
	vlan_vid_add(dsa_port_to_master(dp), htons(ETH_P_8021Q), port);

	return lan9303_enable_processing_port(chip, port);
}
@@ -1105,7 +1105,7 @@ static void lan9303_port_disable(struct dsa_switch *ds, int port)
	if (!dsa_port_is_user(dp))
		return;

	vlan_vid_del(dp->cpu_dp->master, htons(ETH_P_8021Q), port);
	vlan_vid_del(dsa_port_to_master(dp), htons(ETH_P_8021Q), port);

	lan9303_disable_processing_port(chip, port);
	lan9303_phy_write(ds, chip->phy_addr_base + port, MII_BMCR, BMCR_PDOWN);
Loading