Commit 24f627a3 authored by Jakub Kicinski's avatar Jakub Kicinski
Browse files

Merge branch 'implement-devlink-rate-api-and-extend-it'

Michal Wilczynski says:

====================
Implement devlink-rate API and extend it

This patch series implements devlink-rate for ice driver. Unfortunately
current API isn't flexible enough for our use case, so there is a need to
extend it. Some functions have been introduced to enable the driver to
export current Tx scheduling configuration.

Pasting justification for this series from commit implementing devlink-rate
in ice driver(that is a part of this series):

There is a need to support modification of Tx scheduler tree, in the
ice driver. This will allow user to control Tx settings of each node in
the internal hierarchy of nodes. As a result user will be able to use
Hierarchy QoS implemented entirely in the hardware.

This patch implemenents devlink-rate API. It also exports initial
default hierarchy. It's mostly dictated by the fact that the tree
can't be removed entirely, all we can do is enable the user to modify
it. For example root node shouldn't ever be removed, also nodes that
have children are off-limits.

Example initial tree with 2 VF's:

[root@fedora ~]# devlink port function rate show
pci/0000:4b:00.0/node_27: type node parent node_26
pci/0000:4b:00.0/node_26: type node parent node_0
pci/0000:4b:00.0/node_34: type node parent node_33
pci/0000:4b:00.0/node_33: type node parent node_32
pci/0000:4b:00.0/node_32: type node parent node_16
pci/0000:4b:00.0/node_19: type node parent node_18
pci/0000:4b:00.0/node_18: type node parent node_17
pci/0000:4b:00.0/node_17: type node parent node_16
pci/0000:4b:00.0/node_21: type node parent node_20
pci/0000:4b:00.0/node_20: type node parent node_3
pci/0000:4b:00.0/node_14: type node parent node_5
pci/0000:4b:00.0/node_5: type node parent node_3
pci/0000:4b:00.0/node_13: type node parent node_4
pci/0000:4b:00.0/node_12: type node parent node_4
pci/0000:4b:00.0/node_11: type node parent node_4
pci/0000:4b:00.0/node_10: type node parent node_4
pci/0000:4b:00.0/node_9: type node parent node_4
pci/0000:4b:00.0/node_8: type node parent node_4
pci/0000:4b:00.0/node_7: type node parent node_4
pci/0000:4b:00.0/node_6: type node parent node_4
pci/0000:4b:00.0/node_4: type node parent node_3
pci/0000:4b:00.0/node_3: type node parent node_16
pci/0000:4b:00.0/node_16: type node parent node_15
pci/0000:4b:00.0/node_15: type node parent node_0
pci/0000:4b:00.0/node_2: type node parent node_1
pci/0000:4b:00.0/node_1: type node parent node_0
pci/0000:4b:00.0/node_0: type node
pci/0000:4b:00.0/1: type leaf parent node_27
pci/0000:4b:00.0/2: type leaf parent node_27

Let me visualize part of the tree:

                        +---------+
                        |  node_0 |
                        +---------+
                             |
                        +----v----+
                        | node_26 |
                        +----+----+
                             |
                        +----v----+
                        | node_27 |
                        +----+----+
                             |
                    |-----------------|
               +----v----+       +----v----+
               |   VF 1  |       |   VF 2  |
               +----+----+       +----+----+

So at this point there is a couple things that can be done.
For example we could only assign parameters to VF's.

[root@fedora ~]# devlink port function rate set pci/0000:4b:00.0/1 \
                 tx_max 5Gbps

This would cap the VF 1 BW to 5Gbps.

But let's say you would like to create a completely new branch.
This can be done like this:

[root@fedora ~]# devlink port function rate add \
                 pci/0000:4b:00.0/node_custom parent node_0
[root@fedora ~]# devlink port function rate add \
                 pci/0000:4b:00.0/node_custom_1 parent node_custom
[root@fedora ~]# devlink port function rate set \
                 pci/0000:4b:00.0/1 parent node_custom_1

This creates a completely new branch and reassigns VF 1 to it.

A number of parameters is supported per each node: tx_max, tx_share,
tx_priority and tx_weight.
====================

Link: https://lore.kernel.org/r/20221115104825.172668-1-michal.wilczynski@intel.com


Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
parents 4ab45e97 242dd643
Loading
Loading
Loading
Loading
+32 −1
Original line number Diff line number Diff line
@@ -191,13 +191,44 @@ API allows to configure following rate object's parameters:
``tx_max``
  Maximum TX rate value.

``tx_priority``
  Allows for usage of strict priority arbiter among siblings. This
  arbitration scheme attempts to schedule nodes based on their priority
  as long as the nodes remain within their bandwidth limit. The higher the
  priority the higher the probability that the node will get selected for
  scheduling.

``tx_weight``
  Allows for usage of Weighted Fair Queuing arbitration scheme among
  siblings. This arbitration scheme can be used simultaneously with the
  strict priority. As a node is configured with a higher rate it gets more
  BW relative to it's siblings. Values are relative like a percentage
  points, they basically tell how much BW should node take relative to
  it's siblings.

``parent``
  Parent node name. Parent node rate limits are considered as additional limits
  to all node children limits. ``tx_max`` is an upper limit for children.
  ``tx_share`` is a total bandwidth distributed among children.

``tx_priority`` and ``tx_weight`` can be used simultaneously. In that case
nodes with the same priority form a WFQ subgroup in the sibling group
and arbitration among them is based on assigned weights.

Arbitration flow from the high level:
#. Choose a node, or group of nodes with the highest priority that stays
   within the BW limit and are not blocked. Use ``tx_priority`` as a
   parameter for this arbitration.
#. If group of nodes have the same priority perform WFQ arbitration on
   that subgroup. Use ``tx_weight`` as a parameter for this arbitration.
#. Select the winner node, and continue arbitration flow among it's children,
   until leaf node is reached, and the winner is established.
#. If all the nodes from the highest priority sub-group are satisfied, or
   overused their assigned BW, move to the lower priority nodes.

Driver implementations are allowed to support both or either rate object types
and setting methods of their parameters.
and setting methods of their parameters. Additionally driver implementation
may export nodes/leafs and their child-parent relationships.

Terms and Definitions
=====================
+115 −0
Original line number Diff line number Diff line
@@ -254,3 +254,118 @@ Users can request an immediate capture of a snapshot via the
    0000000000000210 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

    $ devlink region delete pci/0000:01:00.0/device-caps snapshot 1

Devlink Rate
============

The ``ice`` driver implements devlink-rate API. It allows for offload of
the Hierarchical QoS to the hardware. It enables user to group Virtual
Functions in a tree structure and assign supported parameters: tx_share,
tx_max, tx_priority and tx_weight to each node in a tree. So effectively
user gains an ability to control how much bandwidth is allocated for each
VF group. This is later enforced by the HW.

It is assumed that this feature is mutually exclusive with DCB performed
in FW and ADQ, or any driver feature that would trigger changes in QoS,
for example creation of the new traffic class. The driver will prevent DCB
or ADQ configuration if user started making any changes to the nodes using
devlink-rate API. To configure those features a driver reload is necessary.
Correspondingly if ADQ or DCB will get configured the driver won't export
hierarchy at all, or will remove the untouched hierarchy if those
features are enabled after the hierarchy is exported, but before any
changes are made.

This feature is also dependent on switchdev being enabled in the system.
It's required bacause devlink-rate requires devlink-port objects to be
present, and those objects are only created in switchdev mode.

If the driver is set to the switchdev mode, it will export internal
hierarchy the moment VF's are created. Root of the tree is always
represented by the node_0. This node can't be deleted by the user. Leaf
nodes and nodes with children also can't be deleted.

.. list-table:: Attributes supported
    :widths: 15 85

    * - Name
      - Description
    * - ``tx_max``
      - maximum bandwidth to be consumed by the tree Node. Rate Limit is
        an absolute number specifying a maximum amount of bytes a Node may
        consume during the course of one second. Rate limit guarantees
        that a link will not oversaturate the receiver on the remote end
        and also enforces an SLA between the subscriber and network
        provider.
    * - ``tx_share``
      - minimum bandwidth allocated to a tree node when it is not blocked.
        It specifies an absolute BW. While tx_max defines the maximum
        bandwidth the node may consume, the tx_share marks committed BW
        for the Node.
    * - ``tx_priority``
      - allows for usage of strict priority arbiter among siblings. This
        arbitration scheme attempts to schedule nodes based on their
        priority as long as the nodes remain within their bandwidth limit.
        Range 0-7. Nodes with priority 7 have the highest priority and are
        selected first, while nodes with priority 0 have the lowest
        priority. Nodes that have the same priority are treated equally.
    * - ``tx_weight``
      - allows for usage of Weighted Fair Queuing arbitration scheme among
        siblings. This arbitration scheme can be used simultaneously with
        the strict priority. Range 1-200. Only relative values mater for
        arbitration.

``tx_priority`` and ``tx_weight`` can be used simultaneously. In that case
nodes with the same priority form a WFQ subgroup in the sibling group
and arbitration among them is based on assigned weights.

.. code:: shell

    # enable switchdev
    $ devlink dev eswitch set pci/0000:4b:00.0 mode switchdev

    # at this point driver should export internal hierarchy
    $ echo 2 > /sys/class/net/ens785np0/device/sriov_numvfs

    $ devlink port function rate show
    pci/0000:4b:00.0/node_25: type node parent node_24
    pci/0000:4b:00.0/node_24: type node parent node_0
    pci/0000:4b:00.0/node_32: type node parent node_31
    pci/0000:4b:00.0/node_31: type node parent node_30
    pci/0000:4b:00.0/node_30: type node parent node_16
    pci/0000:4b:00.0/node_19: type node parent node_18
    pci/0000:4b:00.0/node_18: type node parent node_17
    pci/0000:4b:00.0/node_17: type node parent node_16
    pci/0000:4b:00.0/node_14: type node parent node_5
    pci/0000:4b:00.0/node_5: type node parent node_3
    pci/0000:4b:00.0/node_13: type node parent node_4
    pci/0000:4b:00.0/node_12: type node parent node_4
    pci/0000:4b:00.0/node_11: type node parent node_4
    pci/0000:4b:00.0/node_10: type node parent node_4
    pci/0000:4b:00.0/node_9: type node parent node_4
    pci/0000:4b:00.0/node_8: type node parent node_4
    pci/0000:4b:00.0/node_7: type node parent node_4
    pci/0000:4b:00.0/node_6: type node parent node_4
    pci/0000:4b:00.0/node_4: type node parent node_3
    pci/0000:4b:00.0/node_3: type node parent node_16
    pci/0000:4b:00.0/node_16: type node parent node_15
    pci/0000:4b:00.0/node_15: type node parent node_0
    pci/0000:4b:00.0/node_2: type node parent node_1
    pci/0000:4b:00.0/node_1: type node parent node_0
    pci/0000:4b:00.0/node_0: type node
    pci/0000:4b:00.0/1: type leaf parent node_25
    pci/0000:4b:00.0/2: type leaf parent node_25

    # let's create some custom node
    $ devlink port function rate add pci/0000:4b:00.0/node_custom parent node_0

    # second custom node
    $ devlink port function rate add pci/0000:4b:00.0/node_custom_1 parent node_custom

    # reassign second VF to newly created branch
    $ devlink port function rate set pci/0000:4b:00.0/2 parent node_custom_1

    # assign tx_weight to the VF
    $ devlink port function rate set pci/0000:4b:00.0/2 tx_weight 5

    # assign tx_share to the VF
    $ devlink port function rate set pci/0000:4b:00.0/2 tx_share 500Mbps
+2 −2
Original line number Diff line number Diff line
@@ -848,9 +848,9 @@ struct ice_aqc_txsched_elem {
	u8 generic;
#define ICE_AQC_ELEM_GENERIC_MODE_M		0x1
#define ICE_AQC_ELEM_GENERIC_PRIO_S		0x1
#define ICE_AQC_ELEM_GENERIC_PRIO_M	(0x7 << ICE_AQC_ELEM_GENERIC_PRIO_S)
#define ICE_AQC_ELEM_GENERIC_PRIO_M	        GENMASK(3, 1)
#define ICE_AQC_ELEM_GENERIC_SP_S		0x4
#define ICE_AQC_ELEM_GENERIC_SP_M	(0x1 << ICE_AQC_ELEM_GENERIC_SP_S)
#define ICE_AQC_ELEM_GENERIC_SP_M	        GENMASK(4, 4)
#define ICE_AQC_ELEM_GENERIC_ADJUST_VAL_S	0x5
#define ICE_AQC_ELEM_GENERIC_ADJUST_VAL_M	\
	(0x3 << ICE_AQC_ELEM_GENERIC_ADJUST_VAL_S)
+5 −2
Original line number Diff line number Diff line
@@ -1105,6 +1105,9 @@ int ice_init_hw(struct ice_hw *hw)

	hw->evb_veb = true;

	/* init xarray for identifying scheduling nodes uniquely */
	xa_init_flags(&hw->port_info->sched_node_ids, XA_FLAGS_ALLOC);

	/* Query the allocated resources for Tx scheduler */
	status = ice_sched_query_res_alloc(hw);
	if (status) {
@@ -4600,7 +4603,7 @@ ice_ena_vsi_txq(struct ice_port_info *pi, u16 vsi_handle, u8 tc, u16 q_handle,
	q_ctx->q_teid = le32_to_cpu(node.node_teid);

	/* add a leaf node into scheduler tree queue layer */
	status = ice_sched_add_node(pi, hw->num_tx_sched_layers - 1, &node);
	status = ice_sched_add_node(pi, hw->num_tx_sched_layers - 1, &node, NULL);
	if (!status)
		status = ice_sched_replay_q_bw(pi, q_ctx);

@@ -4835,7 +4838,7 @@ ice_ena_vsi_rdma_qset(struct ice_port_info *pi, u16 vsi_handle, u8 tc,
	for (i = 0; i < num_qsets; i++) {
		node.node_teid = buf->rdma_qsets[i].qset_teid;
		ret = ice_sched_add_node(pi, hw->num_tx_sched_layers - 1,
					 &node);
					 &node, NULL);
		if (ret)
			break;
		qset_teid[i] = le32_to_cpu(node.node_teid);
+1 −1
Original line number Diff line number Diff line
@@ -1580,7 +1580,7 @@ ice_update_port_tc_tree_cfg(struct ice_port_info *pi,
		/* new TC */
		status = ice_sched_query_elem(pi->hw, teid2, &elem);
		if (!status)
			status = ice_sched_add_node(pi, 1, &elem);
			status = ice_sched_add_node(pi, 1, &elem, NULL);
		if (status)
			break;
		/* update the TC number */
Loading