Commit 24a790da authored by Jakub Kicinski's avatar Jakub Kicinski
Browse files

Merge tag 'mlx5-updates-2021-01-13' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
mlx5 subfunction support

Parav Pandit says:

This patchset introduces support for mlx5 subfunction (SF).

A subfunction is a lightweight function that has a parent PCI function on
which it is deployed. mlx5 subfunction has its own function capabilities
and its own resources. This means a subfunction has its own dedicated
queues(txq, rxq, cq, eq). These queues are neither shared nor stolen from
the parent PCI function.

When subfunction is RDMA capable, it has its own QP1, GID table and rdma
resources neither shared nor stolen from the parent PCI function.

A subfunction has dedicated window in PCI BAR space that is not shared
with the other subfunctions or parent PCI function. This ensures that all
class devices of the subfunction accesses only assigned PCI BAR space.

A Subfunction supports eswitch representation through which it supports tc
offloads. User must configure eswitch to send/receive packets from/to
subfunction port.

Subfunctions share PCI level resources such as PCI MSI-X IRQs with
their other subfunctions and/or with its parent PCI function.

Subfunction support is discussed in detail in RFC [1] and [2].
RFC [1] and extension [2] describes requirements, design and proposed
plumbing using devlink, auxiliary bus and sysfs for systemd/udev
support. Functionality of this patchset is best explained using real
examples further below.

overview:
--------
A subfunction can be created and deleted by a user using devlink port
add/delete interface.

A subfunction can be configured using devlink port function attribute
before its activated.

When a subfunction is activated, it results in an auxiliary device on
the host PCI device where it is deployed. A driver binds to the
auxiliary device that further creates supported class devices.

example subfunction usage sequence:
-----------------------------------
Change device to switchdev mode:
$ devlink dev eswitch set pci/0000:06:00.0 mode switchdev

Add a devlink port of subfunction flavour:
$ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88

Configure mac address of the port function:
$ devlink port function set ens2f0npf0sf88 hw_addr 00:00:00:00:88:88

Now activate the function:
$ devlink port function set ens2f0npf0sf88 state active

Now use the auxiliary device and class devices:
$ devlink dev show
pci/0000:06:00.0
auxiliary/mlx5_core.sf.4

$ ip link show
127: ens2f0np0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 24:8a:07:b3:d1:12 brd ff:ff:ff:ff:ff:ff
    altname enp6s0f0np0
129: p0sf88: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 00:00:00:00:88:88 brd ff:ff:ff:ff:ff:ff

$ rdma dev show
43: rdmap6s0f0: node_type ca fw 16.29.0550 node_guid 248a:0703:00b3:d112 sys_image_guid 248a:0703:00b3:d112
44: mlx5_0: node_type ca fw 16.29.0550 node_guid 0000:00ff:fe00:8888 sys_image_guid 248a:0703:00b3:d112

After use inactivate the function:
$ devlink port function set ens2f0npf0sf88 state inactive

Now delete the subfunction port:
$ devlink port del ens2f0npf0sf88

[1] https://lore.kernel.org/netdev/20200519092258.GF4655@nanopsycho/
[2] https://marc.info/?l=linux-netdev&m=158555928517777&w=2

=================

* tag 'mlx5-updates-2021-01-13' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
  net/mlx5: Add devlink subfunction port documentation
  devlink: Extend devlink port documentation for subfunctions
  devlink: Add devlink port documentation
  net/mlx5: SF, Port function state change support
  net/mlx5: SF, Add port add delete functionality
  net/mlx5: E-switch, Add eswitch helpers for SF vport
  net/mlx5: E-switch, Prepare eswitch to handle SF vport
  net/mlx5: SF, Add auxiliary device driver
  net/mlx5: SF, Add auxiliary device support
  net/mlx5: Introduce vhca state event notifier
  devlink: Support get and set state of port function
  devlink: Support add and delete devlink port
  devlink: Introduce PCI SF port flavour and port attribute
  devlink: Prepare code to fill multiple port function attributes
====================

Link: https://lore.kernel.org/r/20210122193658.282884-1-saeed@kernel.org


Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
parents 32e31b78 142d93d1
Loading
Loading
Loading
Loading
+2 −0
Original line number Diff line number Diff line
.. SPDX-License-Identifier: GPL-2.0-only

.. _auxiliary_bus:

=============
Auxiliary Bus
=============
+215 −0
Original line number Diff line number Diff line
@@ -12,6 +12,8 @@ Contents
- `Enabling the driver and kconfig options`_
- `Devlink info`_
- `Devlink parameters`_
- `mlx5 subfunction`_
- `mlx5 port function`_
- `Devlink health reporters`_
- `mlx5 tracepoints`_

@@ -97,6 +99,11 @@ Enabling the driver and kconfig options

|   Provides low-level InfiniBand/RDMA and `RoCE <https://community.mellanox.com/s/article/recommended-network-configuration-examples-for-roce-deployment>`_ support.

**CONFIG_MLX5_SF=(y/n)**

|   Build support for subfunction.
|   Subfunctons are more light weight than PCI SRIOV VFs. Choosing this option
|   will enable support for creating subfunction devices.

**External options** ( Choose if the corresponding mlx5 feature is required )

@@ -176,6 +183,214 @@ User command examples:
      values:
         cmode driverinit value true

mlx5 subfunction
================
mlx5 supports subfunction management using devlink port (see :ref:`Documentation/networking/devlink/devlink-port.rst <devlink_port>`) interface.

A Subfunction has its own function capabilities and its own resources. This
means a subfunction has its own dedicated queues (txq, rxq, cq, eq). These
queues are neither shared nor stolen from the parent PCI function.

When a subfunction is RDMA capable, it has its own QP1, GID table and rdma
resources neither shared nor stolen from the parent PCI function.

A subfunction has a dedicated window in PCI BAR space that is not shared
with ther other subfunctions or the parent PCI function. This ensures that all
devices (netdev, rdma, vdpa etc.) of the subfunction accesses only assigned
PCI BAR space.

A Subfunction supports eswitch representation through which it supports tc
offloads. The user configures eswitch to send/receive packets from/to
the subfunction port.

Subfunctions share PCI level resources such as PCI MSI-X IRQs with
other subfunctions and/or with its parent PCI function.

Example mlx5 software, system and device view::

       _______
      | admin |
      | user  |----------
      |_______|         |
          |             |
      ____|____       __|______            _________________
     |         |     |         |          |                 |
     | devlink |     | tc tool |          |    user         |
     | tool    |     |_________|          | applications    |
     |_________|         |                |_________________|
           |             |                   |          |
           |             |                   |          |         Userspace
 +---------|-------------|-------------------|----------|--------------------+
           |             |           +----------+   +----------+   Kernel
           |             |           |  netdev  |   | rdma dev |
           |             |           +----------+   +----------+
   (devlink port add/del |              ^               ^
    port function set)   |              |               |
           |             |              +---------------|
      _____|___          |              |        _______|_______
     |         |         |              |       | mlx5 class    |
     | devlink |   +------------+       |       |   drivers     |
     | kernel  |   | rep netdev |       |       |(mlx5_core,ib) |
     |_________|   +------------+       |       |_______________|
           |             |              |               ^
   (devlink ops)         |              |          (probe/remove)
  _________|________     |              |           ____|________
 | subfunction      |    |     +---------------+   | subfunction |
 | management driver|-----     | subfunction   |---|  driver     |
 | (mlx5_core)      |          | auxiliary dev |   | (mlx5_core) |
 |__________________|          +---------------+   |_____________|
           |                                            ^
  (sf add/del, vhca events)                             |
           |                                      (device add/del)
      _____|____                                    ____|________
     |          |                                  | subfunction |
     |  PCI NIC |---- activate/deactive events---->| host driver |
     |__________|                                  | (mlx5_core) |
                                                   |_____________|

Subfunction is created using devlink port interface.

- Change device to switchdev mode::

    $ devlink dev eswitch set pci/0000:06:00.0 mode switchdev

- Add a devlink port of subfunction flaovur::

    $ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88
    pci/0000:06:00.0/32768: type eth netdev eth6 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
      function:
        hw_addr 00:00:00:00:00:00 state inactive opstate detached

- Show a devlink port of the subfunction::

    $ devlink port show pci/0000:06:00.0/32768
    pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88
      function:
        hw_addr 00:00:00:00:00:00 state inactive opstate detached

- Delete a devlink port of subfunction after use::

    $ devlink port del pci/0000:06:00.0/32768

mlx5 function attributes
========================
The mlx5 driver provides a mechanism to setup PCI VF/SF function attributes in
a unified way for SmartNIC and non-SmartNIC.

This is supported only when the eswitch mode is set to switchdev. Port function
configuration of the PCI VF/SF is supported through devlink eswitch port.

Port function attributes should be set before PCI VF/SF is enumerated by the
driver.

MAC address setup
-----------------
mlx5 driver provides mechanism to setup the MAC address of the PCI VF/SF.

The configured MAC address of the PCI VF/SF will be used by netdevice and rdma
device created for the PCI VF/SF.

- Get the MAC address of the VF identified by its unique devlink port index::

    $ devlink port show pci/0000:06:00.0/2
    pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
      function:
        hw_addr 00:00:00:00:00:00

- Set the MAC address of the VF identified by its unique devlink port index::

    $ devlink port function set pci/0000:06:00.0/2 hw_addr 00:11:22:33:44:55

    $ devlink port show pci/0000:06:00.0/2
    pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
      function:
        hw_addr 00:11:22:33:44:55

- Get the MAC address of the SF identified by its unique devlink port index::

    $ devlink port show pci/0000:06:00.0/32768
    pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88
      function:
        hw_addr 00:00:00:00:00:00

- Set the MAC address of the VF identified by its unique devlink port index::

    $ devlink port function set pci/0000:06:00.0/32768 hw_addr 00:00:00:00:88:88

    $ devlink port show pci/0000:06:00.0/32768
    pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcivf pfnum 0 sfnum 88
      function:
        hw_addr 00:00:00:00:88:88

SF state setup
--------------
To use the SF, the user must active the SF using the SF function state
attribute.

- Get the state of the SF identified by its unique devlink port index::

   $ devlink port show ens2f0npf0sf88
   pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
     function:
       hw_addr 00:00:00:00:88:88 state inactive opstate detached

- Activate the function and verify its state is active::

   $ devlink port function set ens2f0npf0sf88 state active

   $ devlink port show ens2f0npf0sf88
   pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
     function:
       hw_addr 00:00:00:00:88:88 state active opstate detached

Upon function activation, the PF driver instance gets the event from the device
that a particular SF was activated. It's the cue to put the device on bus, probe
it and instantiate the devlink instance and class specific auxiliary devices
for it.

- Show the auxiliary device and port of the subfunction::

    $ devlink dev show
    devlink dev show auxiliary/mlx5_core.sf.4

    $ devlink port show auxiliary/mlx5_core.sf.4/1
    auxiliary/mlx5_core.sf.4/1: type eth netdev p0sf88 flavour virtual port 0 splittable false

    $ rdma link show mlx5_0/1
    link mlx5_0/1 state ACTIVE physical_state LINK_UP netdev p0sf88

    $ rdma dev show
    8: rocep6s0f1: node_type ca fw 16.29.0550 node_guid 248a:0703:00b3:d113 sys_image_guid 248a:0703:00b3:d112
    13: mlx5_0: node_type ca fw 16.29.0550 node_guid 0000:00ff:fe00:8888 sys_image_guid 248a:0703:00b3:d112

- Subfunction auxiliary device and class device hierarchy::

                 mlx5_core.sf.4
          (subfunction auxiliary device)
                       /\
                      /  \
                     /    \
                    /      \
                   /        \
      mlx5_core.eth.4     mlx5_core.rdma.4
     (sf eth aux dev)     (sf rdma aux dev)
         |                      |
         |                      |
      p0sf88                  mlx5_0
     (sf netdev)          (sf rdma device)

Additionally, the SF port also gets the event when the driver attaches to the
auxiliary device of the subfunction. This results in changing the operational
state of the function. This provides visiblity to the user to decide when is it
safe to delete the SF port for graceful termination of the subfunction.

- Show the SF port operational state::

    $ devlink port show ens2f0npf0sf88
    pci/0000:06:00.0/32768: type eth netdev ens2f0npf0sf88 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
      function:
        hw_addr 00:00:00:00:88:88 state active opstate attached

Devlink health reporters
========================

+199 −0
Original line number Diff line number Diff line
.. SPDX-License-Identifier: GPL-2.0

.. _devlink_port:

============
Devlink Port
============

``devlink-port`` is a port that exists on the device. It has a logically
separate ingress/egress point of the device. A devlink port can be any one
of many flavours. A devlink port flavour along with port attributes
describe what a port represents.

A device driver that intends to publish a devlink port sets the
devlink port attributes and registers the devlink port.

Devlink port flavours are described below.

.. list-table:: List of devlink port flavours
   :widths: 33 90

   * - Flavour
     - Description
   * - ``DEVLINK_PORT_FLAVOUR_PHYSICAL``
     - Any kind of physical port. This can be an eswitch physical port or any
       other physical port on the device.
   * - ``DEVLINK_PORT_FLAVOUR_DSA``
     - This indicates a DSA interconnect port.
   * - ``DEVLINK_PORT_FLAVOUR_CPU``
     - This indicates a CPU port applicable only to DSA.
   * - ``DEVLINK_PORT_FLAVOUR_PCI_PF``
     - This indicates an eswitch port representing a port of PCI
       physical function (PF).
   * - ``DEVLINK_PORT_FLAVOUR_PCI_VF``
     - This indicates an eswitch port representing a port of PCI
       virtual function (VF).
   * - ``DEVLINK_PORT_FLAVOUR_PCI_SF``
     - This indicates an eswitch port representing a port of PCI
       subfunction (SF).
   * - ``DEVLINK_PORT_FLAVOUR_VIRTUAL``
     - This indicates a virtual port for the PCI virtual function.

Devlink port can have a different type based on the link layer described below.

.. list-table:: List of devlink port types
   :widths: 23 90

   * - Type
     - Description
   * - ``DEVLINK_PORT_TYPE_ETH``
     - Driver should set this port type when a link layer of the port is
       Ethernet.
   * - ``DEVLINK_PORT_TYPE_IB``
     - Driver should set this port type when a link layer of the port is
       InfiniBand.
   * - ``DEVLINK_PORT_TYPE_AUTO``
     - This type is indicated by the user when driver should detect the port
       type automatically.

PCI controllers
---------------
In most cases a PCI device has only one controller. A controller consists of
potentially multiple physical, virtual functions and subfunctions. A function
consists of one or more ports. This port is represented by the devlink eswitch
port.

A PCI device connected to multiple CPUs or multiple PCI root complexes or a
SmartNIC, however, may have multiple controllers. For a device with multiple
controllers, each controller is distinguished by a unique controller number.
An eswitch is on the PCI device which supports ports of multiple controllers.

An example view of a system with two controllers::

                 ---------------------------------------------------------
                 |                                                       |
                 |           --------- ---------         ------- ------- |
    -----------  |           | vf(s) | | sf(s) |         |vf(s)| |sf(s)| |
    | server  |  | -------   ----/---- ---/----- ------- ---/--- ---/--- |
    | pci rc  |=== | pf0 |______/________/       | pf1 |___/_______/     |
    | connect |  | -------                       -------                 |
    -----------  |     | controller_num=1 (no eswitch)                   |
                 ------|--------------------------------------------------
                 (internal wire)
                       |
                 ---------------------------------------------------------
                 | devlink eswitch ports and reps                        |
                 | ----------------------------------------------------- |
                 | |ctrl-0 | ctrl-0 | ctrl-0 | ctrl-0 | ctrl-0 |ctrl-0 | |
                 | |pf0    | pf0vfN | pf0sfN | pf1    | pf1vfN |pf1sfN | |
                 | ----------------------------------------------------- |
                 | |ctrl-1 | ctrl-1 | ctrl-1 | ctrl-1 | ctrl-1 |ctrl-1 | |
                 | |pf0    | pf0vfN | pf0sfN | pf1    | pf1vfN |pf1sfN | |
                 | ----------------------------------------------------- |
                 |                                                       |
                 |                                                       |
    -----------  |           --------- ---------         ------- ------- |
    | smartNIC|  |           | vf(s) | | sf(s) |         |vf(s)| |sf(s)| |
    | pci rc  |==| -------   ----/---- ---/----- ------- ---/--- ---/--- |
    | connect |  | | pf0 |______/________/       | pf1 |___/_______/     |
    -----------  | -------                       -------                 |
                 |                                                       |
                 |  local controller_num=0 (eswitch)                     |
                 ---------------------------------------------------------

In the above example, the external controller (identified by controller number = 1)
doesn't have the eswitch. Local controller (identified by controller number = 0)
has the eswitch. The Devlink instance on the local controller has eswitch
devlink ports for both the controllers.

Function configuration
======================

A user can configure the function attribute before enumerating the PCI
function. Usually it means, user should configure function attribute
before a bus specific device for the function is created. However, when
SRIOV is enabled, virtual function devices are created on the PCI bus.
Hence, function attribute should be configured before binding virtual
function device to the driver. For subfunctions, this means user should
configure port function attribute before activating the port function.

A user may set the hardware address of the function using
'devlink port function set hw_addr' command. For Ethernet port function
this means a MAC address.

Subfunction
============

Subfunction is a lightweight function that has a parent PCI function on which
it is deployed. Subfunction is created and deployed in unit of 1. Unlike
SRIOV VFs, a subfunction doesn't require its own PCI virtual function.
A subfunction communicates with the hardware through the parent PCI function.

To use a subfunction, 3 steps setup sequence is followed.
(1) create - create a subfunction;
(2) configure - configure subfunction attributes;
(3) deploy - deploy the subfunction;

Subfunction management is done using devlink port user interface.
User performs setup on the subfunction management device.

(1) Create
----------
A subfunction is created using a devlink port interface. A user adds the
subfunction by adding a devlink port of subfunction flavour. The devlink
kernel code calls down to subfunction management driver (devlink ops) and asks
it to create a subfunction devlink port. Driver then instantiates the
subfunction port and any associated objects such as health reporters and
representor netdevice.

(2) Configure
-------------
A subfunction devlink port is created but it is not active yet. That means the
entities are created on devlink side, the e-switch port representor is created,
but the subfunction device itself it not created. A user might use e-switch port
representor to do settings, putting it into bridge, adding TC rules, etc. A user
might as well configure the hardware address (such as MAC address) of the
subfunction while subfunction is inactive.

(3) Deploy
----------
Once a subfunction is configured, user must activate it to use it. Upon
activation, subfunction management driver asks the subfunction management
device to instantiate the subfunction device on particular PCI function.
A subfunction device is created on the :ref:`Documentation/driver-api/auxiliary_bus.rst <auxiliary_bus>`.
At this point a matching subfunction driver binds to the subfunction's auxiliary device.

Terms and Definitions
=====================

.. list-table:: Terms and Definitions
   :widths: 22 90

   * - Term
     - Definitions
   * - ``PCI device``
     - A physical PCI device having one or more PCI bus consists of one or
       more PCI controllers.
   * - ``PCI controller``
     -  A controller consists of potentially multiple physical functions,
        virtual functions and subfunctions.
   * - ``Port function``
     -  An object to manage the function of a port.
   * - ``Subfunction``
     -  A lightweight function that has parent PCI function on which it is
        deployed.
   * - ``Subfunction device``
     -  A bus device of the subfunction, usually on a auxiliary bus.
   * - ``Subfunction driver``
     -  A device driver for the subfunction auxiliary device.
   * - ``Subfunction management device``
     -  A PCI physical function that supports subfunction management.
   * - ``Subfunction management driver``
     -  A device driver for PCI physical function that supports
        subfunction management using devlink port interface.
   * - ``Subfunction host driver``
     -  A device driver for PCI physical function that hosts subfunction
        devices. In most cases it is same as subfunction management driver. When
        subfunction is used on external controller, subfunction management and
        host drivers are different.
+1 −0
Original line number Diff line number Diff line
@@ -18,6 +18,7 @@ general.
   devlink-info
   devlink-flash
   devlink-params
   devlink-port
   devlink-region
   devlink-resource
   devlink-reload
+19 −0
Original line number Diff line number Diff line
@@ -203,3 +203,22 @@ config MLX5_SW_STEERING
	default y
	help
	Build support for software-managed steering in the NIC.

config MLX5_SF
	bool "Mellanox Technologies subfunction device support using auxiliary device"
	depends on MLX5_CORE && MLX5_CORE_EN
	default n
	help
	Build support for subfuction device in the NIC. A Mellanox subfunction
	device can support RDMA, netdevice and vdpa device.
	It is similar to a SRIOV VF but it doesn't require SRIOV support.

config MLX5_SF_MANAGER
	bool
	depends on MLX5_SF && MLX5_ESWITCH
	default y
	help
	Build support for subfuction port in the NIC. A Mellanox subfunction
	port is managed through devlink.  A subfunction supports RDMA, netdevice
	and vdpa device. It is similar to a SRIOV VF but it doesn't require
	SRIOV support.
Loading