Skip to content
  1. Mar 03, 2022
  2. Mar 02, 2022
    • Sven Eckelmann's avatar
      batman-adv: Demote batadv-on-batadv skip error message · 6ee3c393
      Sven Eckelmann authored
      
      
      The error message "Cannot find parent device" was shown for users of
      macvtap (on batadv devices) whenever the macvtap was moved to a different
      netns. This happens because macvtap doesn't provide an implementation for
      rtnl_link_ops->get_link_net.
      
      The situation for which this message is printed is actually not an error
      but just a warning that the optional sanity check was skipped. So demote
      the message from error to warning and adjust the text to better explain
      what happened.
      
      Reported-by: default avatarLeonardo Mörlein <freifunk@irrelefant.net>
      Signed-off-by: default avatarSven Eckelmann <sven@narfation.org>
      Signed-off-by: default avatarSimon Wunderlich <sw@simonwunderlich.de>
      6ee3c393
    • Sven Eckelmann's avatar
      batman-adv: Migrate to linux/container_of.h · eb7da4f1
      Sven Eckelmann authored
      The commit d2a8ebbf
      
       ("kernel.h: split out container_of() and
      typeof_member() macros")  introduced a new header for the container_of
      related macros from (previously) linux/kernel.h.
      
      Signed-off-by: default avatarSven Eckelmann <sven@narfation.org>
      Signed-off-by: default avatarSimon Wunderlich <sw@simonwunderlich.de>
      eb7da4f1
    • Jakub Kicinski's avatar
      Merge branch 'if_ether-h-add-industrial-fieldbus-ethertypes' · 96946d89
      Jakub Kicinski authored
      
      
      Daniel Braunwarth says:
      
      ====================
      if_ether.h: add industrial fieldbus Ethertypes
      
      This set of patches adds the Ethertypes for PROFINET and EtherCAT.
      
      The defines should be used by iproute2 to extend the list of available link
      layer protocols.
      ====================
      
      Link: https://lore.kernel.org/r/20220228133029.100913-1-daniel@braunwarth.dev
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      96946d89
    • Daniel Braunwarth's avatar
      if_ether.h: add EtherCAT Ethertype · cd73cda7
      Daniel Braunwarth authored
      
      
      Add the Ethertype for EtherCAT protocol.
      
      Signed-off-by: default avatarDaniel Braunwarth <daniel@braunwarth.dev>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      cd73cda7
    • Daniel Braunwarth's avatar
      if_ether.h: add PROFINET Ethertype · dd0ca255
      Daniel Braunwarth authored
      
      
      Add the Ethertype for PROFINET protocol.
      
      Signed-off-by: default avatarDaniel Braunwarth <daniel@braunwarth.dev>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      dd0ca255
    • Sven Eckelmann's avatar
      macvtap: advertise link netns via netlink · a0219215
      Sven Eckelmann authored
      
      
      Assign rtnl_link_ops->get_link_net() callback so that IFLA_LINK_NETNSID is
      added to rtnetlink messages. This fixes iproute2 which otherwise resolved
      the link interface to an interface in the wrong namespace.
      
      Test commands:
      
        ip netns add nst
        ip link add dummy0 type dummy
        ip link add link macvtap0 link dummy0 type macvtap
        ip link set macvtap0 netns nst
        ip -netns nst link show macvtap0
      
      Before:
      
        10: macvtap0@gre0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 500
            link/ether 5e:8f:ae:1d:60:50 brd ff:ff:ff:ff:ff:ff
      
      After:
      
        10: macvtap0@if2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 500
            link/ether 5e:8f:ae:1d:60:50 brd ff:ff:ff:ff:ff:ff link-netnsid 0
      
      Reported-by: default avatarLeonardo Mörlein <freifunk@irrelefant.net>
      Signed-off-by: default avatarSven Eckelmann <sven@narfation.org>
      Link: https://lore.kernel.org/r/20220228003240.1337426-1-sven@narfation.org
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      a0219215
    • Wan Jiabing's avatar
      nfp: avoid newline at end of message in NL_SET_ERR_MSG_MOD · 323d51ca
      Wan Jiabing authored
      
      
      Fix the following coccicheck warning:
      ./drivers/net/ethernet/netronome/nfp/flower/qos_conf.c:750:7-55: WARNING
      avoid newline at end of message in NL_SET_ERR_MSG_MOD
      
      Signed-off-by: default avatarWan Jiabing <wanjiabing@vivo.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Link: https://lore.kernel.org/r/20220301112356.1820985-1-wanjiabing@vivo.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      323d51ca
    • Harold Huang's avatar
      tun: support NAPI for packets received from batched XDP buffs · fb3f9037
      Harold Huang authored
      
      
      In tun, NAPI is supported and we can also use NAPI in the path of
      batched XDP buffs to accelerate packet processing. What is more, after
      we use NAPI, GRO is also supported. The iperf shows that the throughput of
      single stream could be improved from 4.5Gbps to 9.2Gbps. Additionally, 9.2
      Gbps nearly reachs the line speed of the phy nic and there is still about
      15% idle cpu core remaining on the vhost thread.
      
      Test topology:
      [iperf server]<--->tap<--->dpdk testpmd<--->phy nic<--->[iperf client]
      
      Iperf stream:
      iperf3 -c 10.0.0.2  -i 1 -t 10
      
      Before:
      ...
      [  5]   5.00-6.00   sec   558 MBytes  4.68 Gbits/sec    0   1.50 MBytes
      [  5]   6.00-7.00   sec   556 MBytes  4.67 Gbits/sec    1   1.35 MBytes
      [  5]   7.00-8.00   sec   556 MBytes  4.67 Gbits/sec    2   1.18 MBytes
      [  5]   8.00-9.00   sec   559 MBytes  4.69 Gbits/sec    0   1.48 MBytes
      [  5]   9.00-10.00  sec   556 MBytes  4.67 Gbits/sec    1   1.33 MBytes
      - - - - - - - - - - - - - - - - - - - - - - - - -
      [ ID] Interval           Transfer     Bitrate         Retr
      [  5]   0.00-10.00  sec  5.39 GBytes  4.63 Gbits/sec   72          sender
      [  5]   0.00-10.04  sec  5.39 GBytes  4.61 Gbits/sec               receiver
      
      After:
      ...
      [  5]   5.00-6.00   sec  1.07 GBytes  9.19 Gbits/sec    0   1.55 MBytes
      [  5]   6.00-7.00   sec  1.08 GBytes  9.30 Gbits/sec    0   1.63 MBytes
      [  5]   7.00-8.00   sec  1.08 GBytes  9.25 Gbits/sec    0   1.72 MBytes
      [  5]   8.00-9.00   sec  1.08 GBytes  9.25 Gbits/sec   77   1.31 MBytes
      [  5]   9.00-10.00  sec  1.08 GBytes  9.24 Gbits/sec    0   1.48 MBytes
      - - - - - - - - - - - - - - - - - - - - - - - - -
      [ ID] Interval           Transfer     Bitrate         Retr
      [  5]   0.00-10.00  sec  10.8 GBytes  9.28 Gbits/sec  166          sender
      [  5]   0.00-10.04  sec  10.8 GBytes  9.24 Gbits/sec               receiver
      
      Reported-at: https://lore.kernel.org/all/CACGkMEvTLG0Ayg+TtbN4q4pPW-ycgCCs3sC3-TF8cuRTf7Pp1A@mail.gmail.com
      Signed-off-by: default avatarHarold Huang <baymaxhuang@gmail.com>
      Acked-by: default avatarJason Wang <jasowang@redhat.com>
      Link: https://lore.kernel.org/r/20220228033805.1579435-1-baymaxhuang@gmail.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      fb3f9037
    • Jakub Kicinski's avatar
      Merge branch 'sfc-optimize-rxqs-count-and-affinities' · 422ce836
      Jakub Kicinski authored
      
      
      Íñigo Huguet says:
      
      ====================
      sfc: optimize RXQs count and affinities
      
      In sfc driver one RX queue per physical core was allocated by default.
      Later on, IRQ affinities were set spreading the IRQs in all NUMA local
      CPUs.
      
      However, with that default configuration it result in a non very optimal
      configuration in many modern systems. Specifically, in systems with hyper
      threading and 2 NUMA nodes, affinities are set in a way that IRQs are
      handled by all logical cores of one same NUMA node. Handling IRQs from
      both hyper threading siblings has no benefit, and setting affinities to one
      queue per physical core is neither a very good idea because there is a
      performance penalty for moving data across nodes (I was able to check it
      with some XDP tests using pktgen).
      
      This patches reduce the default number of channels to one per physical
      core in the local NUMA node. Then, they set IRQ affinities to CPUs in
      the local NUMA node only. This way we save hardware resources since
      channels are limited resources. We also leave more room for XDP_TX
      channels without hitting driver's limit of 32 channels per interface.
      
      Running performance tests using iperf with a SFC9140 device showed no
      performance penalty for reducing the number of channels.
      
      RX XDP tests showed that performance can go down to less than half if
      the IRQ is handled by a CPU in a different NUMA node, which doesn't
      happen with the new defaults from this patches.
      ====================
      
      Link: https://lore.kernel.org/r/20220228132254.25787-1-ihuguet@redhat.com
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      422ce836
    • Íñigo Huguet's avatar
      sfc: set affinity hints in local NUMA node only · 09a99ab1
      Íñigo Huguet authored
      
      
      Affinity hints were being set to CPUs in local NUMA node first, and then
      in other CPUs. This was creating 2 unintended issues:
      1. Channels created to be assigned each to a different physical core
         were assigned to hyperthreading siblings because of being in same
         NUMA node.
         Since the patch previous to this one, this did not longer happen
         with default rss_cpus modparam because less channels are created.
      2. XDP channels could be assigned to CPUs in different NUMA nodes,
         decreasing performance too much (to less than half in some of my
         tests).
      
      This patch sets the affinity hints spreading the channels only in local
      NUMA node's CPUs. A fallback for the case that no CPU in local NUMA node
      is online has been added too.
      
      Example of CPUs being assigned in a non optimal way before this and the
      previous patch (note: in this system, xdp-8 to xdp-15 are created
      because num_possible_cpus == 64, but num_present_cpus == 32 so they're
      never used):
      
      $ lscpu | grep -i numa
      NUMA node(s):                    2
      NUMA node0 CPU(s):               0-7,16-23
      NUMA node1 CPU(s):               8-15,24-31
      
      $ grep -H . /proc/irq/*/0000:07:00.0*/../smp_affinity_list
      /proc/irq/141/0000:07:00.0-0/../smp_affinity_list:0
      /proc/irq/142/0000:07:00.0-1/../smp_affinity_list:1
      /proc/irq/143/0000:07:00.0-2/../smp_affinity_list:2
      /proc/irq/144/0000:07:00.0-3/../smp_affinity_list:3
      /proc/irq/145/0000:07:00.0-4/../smp_affinity_list:4
      /proc/irq/146/0000:07:00.0-5/../smp_affinity_list:5
      /proc/irq/147/0000:07:00.0-6/../smp_affinity_list:6
      /proc/irq/148/0000:07:00.0-7/../smp_affinity_list:7
      /proc/irq/149/0000:07:00.0-8/../smp_affinity_list:16
      /proc/irq/150/0000:07:00.0-9/../smp_affinity_list:17
      /proc/irq/151/0000:07:00.0-10/../smp_affinity_list:18
      /proc/irq/152/0000:07:00.0-11/../smp_affinity_list:19
      /proc/irq/153/0000:07:00.0-12/../smp_affinity_list:20
      /proc/irq/154/0000:07:00.0-13/../smp_affinity_list:21
      /proc/irq/155/0000:07:00.0-14/../smp_affinity_list:22
      /proc/irq/156/0000:07:00.0-15/../smp_affinity_list:23
      /proc/irq/157/0000:07:00.0-xdp-0/../smp_affinity_list:8
      /proc/irq/158/0000:07:00.0-xdp-1/../smp_affinity_list:9
      /proc/irq/159/0000:07:00.0-xdp-2/../smp_affinity_list:10
      /proc/irq/160/0000:07:00.0-xdp-3/../smp_affinity_list:11
      /proc/irq/161/0000:07:00.0-xdp-4/../smp_affinity_list:12
      /proc/irq/162/0000:07:00.0-xdp-5/../smp_affinity_list:13
      /proc/irq/163/0000:07:00.0-xdp-6/../smp_affinity_list:14
      /proc/irq/164/0000:07:00.0-xdp-7/../smp_affinity_list:15
      /proc/irq/165/0000:07:00.0-xdp-8/../smp_affinity_list:24
      /proc/irq/166/0000:07:00.0-xdp-9/../smp_affinity_list:25
      /proc/irq/167/0000:07:00.0-xdp-10/../smp_affinity_list:26
      /proc/irq/168/0000:07:00.0-xdp-11/../smp_affinity_list:27
      /proc/irq/169/0000:07:00.0-xdp-12/../smp_affinity_list:28
      /proc/irq/170/0000:07:00.0-xdp-13/../smp_affinity_list:29
      /proc/irq/171/0000:07:00.0-xdp-14/../smp_affinity_list:30
      /proc/irq/172/0000:07:00.0-xdp-15/../smp_affinity_list:31
      
      CPUs assignments after this and previous patch, so normal channels
      created only one per core in NUMA node and affinities set only to local
      NUMA node:
      
      $ grep -H . /proc/irq/*/0000:07:00.0*/../smp_affinity_list
      /proc/irq/116/0000:07:00.0-0/../smp_affinity_list:0
      /proc/irq/117/0000:07:00.0-1/../smp_affinity_list:1
      /proc/irq/118/0000:07:00.0-2/../smp_affinity_list:2
      /proc/irq/119/0000:07:00.0-3/../smp_affinity_list:3
      /proc/irq/120/0000:07:00.0-4/../smp_affinity_list:4
      /proc/irq/121/0000:07:00.0-5/../smp_affinity_list:5
      /proc/irq/122/0000:07:00.0-6/../smp_affinity_list:6
      /proc/irq/123/0000:07:00.0-7/../smp_affinity_list:7
      /proc/irq/124/0000:07:00.0-xdp-0/../smp_affinity_list:16
      /proc/irq/125/0000:07:00.0-xdp-1/../smp_affinity_list:17
      /proc/irq/126/0000:07:00.0-xdp-2/../smp_affinity_list:18
      /proc/irq/127/0000:07:00.0-xdp-3/../smp_affinity_list:19
      /proc/irq/128/0000:07:00.0-xdp-4/../smp_affinity_list:20
      /proc/irq/129/0000:07:00.0-xdp-5/../smp_affinity_list:21
      /proc/irq/130/0000:07:00.0-xdp-6/../smp_affinity_list:22
      /proc/irq/131/0000:07:00.0-xdp-7/../smp_affinity_list:23
      /proc/irq/132/0000:07:00.0-xdp-8/../smp_affinity_list:0
      /proc/irq/133/0000:07:00.0-xdp-9/../smp_affinity_list:1
      /proc/irq/134/0000:07:00.0-xdp-10/../smp_affinity_list:2
      /proc/irq/135/0000:07:00.0-xdp-11/../smp_affinity_list:3
      /proc/irq/136/0000:07:00.0-xdp-12/../smp_affinity_list:4
      /proc/irq/137/0000:07:00.0-xdp-13/../smp_affinity_list:5
      /proc/irq/138/0000:07:00.0-xdp-14/../smp_affinity_list:6
      /proc/irq/139/0000:07:00.0-xdp-15/../smp_affinity_list:7
      
      Signed-off-by: default avatarÍñigo Huguet <ihuguet@redhat.com>
      Acked-by: default avatarMartin Habets <habetsm.xilinx@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      09a99ab1
    • Íñigo Huguet's avatar
      sfc: default config to 1 channel/core in local NUMA node only · c265b569
      Íñigo Huguet authored
      
      
      Handling channels from CPUs in different NUMA node can penalize
      performance, so better configure only one channel per core in the same
      NUMA node than the NIC, and not per each core in the system.
      
      Fallback to all other online cores if there are not online CPUs in local
      NUMA node.
      
      Signed-off-by: default avatarÍñigo Huguet <ihuguet@redhat.com>
      Acked-by: default avatarMartin Habets <habetsm.xilinx@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c265b569
    • Jakub Kicinski's avatar
      net: smc: fix different types in min() · ef739f1d
      Jakub Kicinski authored
      Fix build:
      
       include/linux/minmax.h:45:25: note: in expansion of macro ‘__careful_cmp’
         45 | #define min(x, y)       __careful_cmp(x, y, <)
            |                         ^~~~~~~~~~~~~
       net/smc/smc_tx.c:150:24: note: in expansion of macro ‘min’
        150 |         corking_size = min(sock_net(&smc->sk)->smc.sysctl_autocorking_size,
            |                        ^~~
      
      Fixes: 12bbb0d1
      
       ("net/smc: add sysctl for autocorking")
      Link: https://lore.kernel.org/r/20220301222446.1271127-1-kuba@kernel.org
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ef739f1d
    • Mateusz Palczewski's avatar
      iavf: Remove non-inclusive language · 0a62b209
      Mateusz Palczewski authored
      
      
      Remove non-inclusive language from the iavf driver.
      
      Signed-off-by: default avatarAleksandr Loktionov <aleksandr.loktionov@intel.com>
      Signed-off-by: default avatarMateusz Palczewski <mateusz.palczewski@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      0a62b209
    • Mateusz Palczewski's avatar
      iavf: Fix incorrect use of assigning iavf_status to int · 8fc16be6
      Mateusz Palczewski authored
      
      
      Currently there are functions in iavf_virtchnl.c for polling specific
      virtchnl receive events. These are all assigning iavf_status values to
      int values. Fix this and explicitly assign int values if iavf_status
      is not IAVF_SUCCESS.
      
      Also, refactor a small amount of duplicated code that can be reused by
      all of the previously mentioned functions.
      
      Finally, fix some spacing errors for variable assignment and get rid of
      all the goto statements in the refactored functions for clarity.
      
      Signed-off-by: default avatarBrett Creeley <brett.creeley@intel.com>
      Signed-off-by: default avatarMateusz Palczewski <mateusz.palczewski@intel.com>
      Tested-by: default avatarKonrad Jankowski <konrad0.jankowski@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      8fc16be6
    • Mateusz Palczewski's avatar
      iavf: stop leaking iavf_status as "errno" values · bae569d0
      Mateusz Palczewski authored
      
      
      Several functions in the iAVF core files take status values of the enum
      iavf_status and convert them into integer values. This leads to
      confusion as functions return both Linux errno values and status codes
      intermixed. Reporting status codes as if they were "errno" values can
      lead to confusion when reviewing error logs. Additionally, it can lead
      to unexpected behavior if a return value is not interpreted properly.
      
      Fix this by introducing iavf_status_to_errno, a switch that explicitly
      converts from the status codes into an appropriate error value. Also
      introduce a virtchnl_status_to_errno function for the one case where we
      were returning both virtchnl status codes and iavf_status codes in the
      same function.
      
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Signed-off-by: default avatarMateusz Palczewski <mateusz.palczewski@intel.com>
      Tested-by: default avatarKonrad Jankowski <konrad0.jankowski@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      bae569d0
    • Minghao Chi's avatar
      iavf: remove redundant ret variable · c3fec56e
      Minghao Chi authored
      
      
      Return value directly instead of taking this in another redundant
      variable.
      
      Reported-by: default avatarZeal Robot <zealci@zte.com.cn>
      Signed-off-by: default avatarMinghao Chi <chi.minghao@zte.com.cn>
      Signed-off-by: default avatarCGEL ZTE <cgel.zte@gmail.com>
      Tested-by: default avatarKonrad Jankowski <konrad0.jankowski@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      c3fec56e
    • Mateusz Palczewski's avatar
      iavf: Add usage of new virtchnl format to set default MAC · a3e839d5
      Mateusz Palczewski authored
      
      
      Use new type field of VIRTCHNL_OP_ADD_ETH_ADDR and
      VIRTCHNL_OP_DEL_ETH_ADDR requests to indicate that
      VF wants to change its default MAC address.
      
      Signed-off-by: default avatarSylwester Dziedziuch <sylwesterx.dziedziuch@intel.com>
      Signed-off-by: default avatarJedrzej Jagielski <jedrzej.jagielski@intel.com>
      Signed-off-by: default avatarMateusz Palczewski <mateusz.palczewski@intel.com>
      Tested-by: default avatarKonrad Jankowski <konrad0.jankowski@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      a3e839d5
    • Mateusz Palczewski's avatar
      iavf: refactor processing of VLAN V2 capability message · 87dba256
      Mateusz Palczewski authored
      
      
      In order to handle the capability exchange necessary for
      VIRTCHNL_VF_OFFLOAD_VLAN_V2, the driver must send
      a VIRTCHNL_OP_GET_OFFLOAD_VLAN_V2_CAPS message. This must occur prior to
      __IAVF_CONFIG_ADAPTER, and the driver must wait for the response from
      the PF.
      
      To handle this, the __IAVF_INIT_GET_OFFLOAD_VLAN_V2_CAPS state was
      introduced. This state is intended to process the response from the VLAN
      V2 caps message. This works ok, but is difficult to extend to adding
      more extended capability exchange.
      
      Existing (and future) AVF features are relying more and more on these
      sort of extended ops for processing additional capabilities. Just like
      VLAN V2, this exchange must happen prior to __IAVF_CONFIG_ADPATER.
      
      Since we only send one outstanding AQ message at a time during init, it
      is not clear where to place this state. Adding more capability specific
      states becomes a mess. Instead of having the "previous" state send
      a message and then transition into a capability-specific state,
      introduce __IAVF_EXTENDED_CAPS state. This state will use a list of
      extended_caps that determines what messages to send and receive. As long
      as there are extended_caps bits still set, the driver will remain in
      this state performing one send or one receive per state machine loop.
      
      Refactor the VLAN V2 negotiation to use this new state, and remove the
      capability-specific state. This makes it significantly easier to add
      a new similar capability exchange going forward.
      
      Extended capabilities are processed by having an associated SEND and
      RECV extended capability bit. During __IAVF_EXTENDED_CAPS, the
      driver checks these bits in order by feature, first the send bit for
      a feature, then the recv bit for a feature. Each send flag will call
      a function that sends the necessary response, while each receive flag
      will wait for the response from the PF. If a given feature can't be
      negotiated with the PF, the associated flags will be cleared in
      order to skip processing of that feature.
      
      Signed-off-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Signed-off-by: default avatarMateusz Palczewski <mateusz.palczewski@intel.com>
      Tested-by: default avatarKonrad Jankowski <konrad0.jankowski@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      87dba256
    • Mateusz Palczewski's avatar
      iavf: Add support for 50G/100G in AIM algorithm · d73dd127
      Mateusz Palczewski authored
      
      
      Advanced link speed support was added long back, but adding AIM support was
      missed. This patch adds AIM support for advanced link speed support, which
      allows the algorithm to take into account 50G/100G link speeds. Also, other
      previous speeds are taken into consideration when advanced link speeds are
      supported.
      
      Signed-off-by: default avatarBrett Creeley <brett.creeley@intel.com>
      Signed-off-by: default avatarMateusz Palczewski <mateusz.palczewski@intel.com>
      Reviewed-by: default avatarAlexander Lobakin <alexandr.lobakin@intel.com>
      Tested-by: default avatarKonrad Jankowski <konrad0.jankowski@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      d73dd127
  3. Mar 01, 2022
    • David S. Miller's avatar
      Merge branch 'smc-datapath-opts' · 7282c126
      David S. Miller authored
      
      
      Dust Li says:
      
      ====================
      net/smc: some datapath performance optimizations
      
      This series tries to improve the performance of SMC in datapath.
      
      - patch #1, add sysctl interface to support tuning the behaviour of
        SMC in container environment.
      
      - patch #2/#3, add autocorking support which is very efficient for small
        messages without trade-off for latency.
      
      - patch #4, send directly on setting TCP_NODELAY, without wake up the
        TX worker, this make it consistent with clearing TCP_CORK.
      
      - patch #5, this correct the setting of RMB window update limit, so
        we don't send CDC messages to update peer's RMB window too frequently
        in some cases.
      
      - patch #6, implemented something like NAPI in SMC, decrease the number
        of hardirq when busy.
      
      - patch #7, this moves TX work doing in the BH to the user context when
        sock_lock is hold by user.
      
      With this patchset applied, we can get a good performance gain:
      - qperf tcp_bw test has shown a great improvement. Other benchmarks like
        'netperf TCP_STREAM' or 'sockperf throughput' has similar result.
      - In my testing environment, running qperf tcp_bw and tcp_lat, SMC behaves
        better then TCP in most all message size.
      
      Here are some test results with the following testing command:
      client: smc_run taskset -c 1 qperf smc-server -oo msg_size:1:64K:*2 \
      		-t 30 -vu tcp_{bw|lat}
      server: smc_run taskset -c 1 qperf
      
      ==== Bandwidth ====
       MsgSize        Origin SMC              TCP                SMC with patches
             1         0.578 MB/s      2.392 MB/s(313.57%)      2.561 MB/s(342.83%)
             2         1.159 MB/s      4.780 MB/s(312.53%)      5.162 MB/s(345.46%)
             4         2.283 MB/s     10.266 MB/s(349.77%)     10.122 MB/s(343.46%)
             8         4.668 MB/s     19.040 MB/s(307.86%)     20.521 MB/s(339.59%)
            16         9.147 MB/s     38.904 MB/s(325.31%)     40.823 MB/s(346.29%)
            32        18.369 MB/s     79.587 MB/s(333.25%)     80.535 MB/s(338.42%)
            64        36.562 MB/s    148.668 MB/s(306.61%)    158.170 MB/s(332.60%)
           128        72.961 MB/s    274.913 MB/s(276.80%)    316.217 MB/s(333.41%)
           256       144.705 MB/s    512.059 MB/s(253.86%)    626.019 MB/s(332.62%)
           512       288.873 MB/s    884.977 MB/s(206.35%)   1221.596 MB/s(322.88%)
          1024       574.180 MB/s   1337.736 MB/s(132.98%)   2203.156 MB/s(283.70%)
          2048      1095.192 MB/s   1865.952 MB/s( 70.38%)   3036.448 MB/s(177.25%)
          4096      2066.157 MB/s   2380.337 MB/s( 15.21%)   3834.271 MB/s( 85.58%)
          8192      3717.198 MB/s   2733.073 MB/s(-26.47%)   4904.910 MB/s( 31.95%)
         16384      4742.221 MB/s   2958.693 MB/s(-37.61%)   5220.272 MB/s( 10.08%)
         32768      5349.550 MB/s   3061.285 MB/s(-42.77%)   5321.865 MB/s( -0.52%)
         65536      5162.919 MB/s   3731.408 MB/s(-27.73%)   5245.021 MB/s(  1.59%)
      ==== Latency ====
       MsgSize        Origin SMC              TCP                SMC with patches
             1        10.540 us     11.938 us( 13.26%)         10.356 us( -1.75%)
             2        10.996 us     11.992 us(  9.06%)         10.073 us( -8.39%)
             4        10.229 us     11.687 us( 14.25%)          9.996 us( -2.28%)
             8        10.203 us     11.653 us( 14.21%)         10.063 us( -1.37%)
            16        10.530 us     11.313 us(  7.44%)         10.013 us( -4.91%)
            32        10.241 us     11.586 us( 13.13%)         10.081 us( -1.56%)
            64        10.693 us     11.652 us(  8.97%)          9.986 us( -6.61%)
           128        10.597 us     11.579 us(  9.27%)         10.262 us( -3.16%)
           256        10.409 us     11.957 us( 14.87%)         10.148 us( -2.51%)
           512        11.088 us     12.505 us( 12.78%)         10.206 us( -7.95%)
          1024        11.240 us     12.255 us(  9.03%)         10.631 us( -5.42%)
          2048        11.485 us     16.970 us( 47.76%)         10.981 us( -4.39%)
          4096        12.077 us     13.948 us( 15.49%)         11.847 us( -1.90%)
          8192        13.683 us     16.693 us( 22.00%)         13.336 us( -2.54%)
         16384        16.470 us     23.615 us( 43.38%)         16.519 us(  0.30%)
         32768        22.540 us     40.966 us( 81.75%)         22.452 us( -0.39%)
         65536        34.192 us     73.003 us(113.51%)         33.916 us( -0.81%)
      
      ------------
      Test environment notes:
      1. Testing is run on 2 VMs within the same physical host
      2. The NIC is ConnectX-4Lx, using SRIOV, and passing through 2 VFs to the
         2 VMs respectively.
      3. To decrease jitter, VM's vCPU are binded to each physical CPU, and those
         physical CPUs are all isolated using boot parameter `isolcpus=xxx`
      4. The queue number are set to 1, and interrupt from the queue is binded to
         CPU0 in the guest
      ====================
      
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7282c126
    • Dust Li's avatar
      net/smc: don't send in the BH context if sock_owned_by_user · 6b88af83
      Dust Li authored
      
      
      Send data all the way down to the RDMA device is a time
      consuming operation(get a new slot, maybe do RDMA Write
      and send a CDC, etc). Moving those operations from BH
      to user context is good for performance.
      
      If the sock_lock is hold by user, we don't try to send
      data out in the BH context, but just mark we should
      send. Since the user will release the sock_lock soon, we
      can do the sending there.
      
      Add smc_release_cb() which will be called in release_sock()
      and try send in the callback if needed.
      
      This patch moves the sending part out from BH if sock lock
      is hold by user. In my testing environment, this saves about
      20% softirq in the qperf 4K tcp_bw test in the sender side
      with no noticeable throughput drop.
      
      Signed-off-by: default avatarDust Li <dust.li@linux.alibaba.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6b88af83
    • Dust Li's avatar
      net/smc: don't req_notify until all CQEs drained · a505cce6
      Dust Li authored
      
      
      When we are handling softirq workload, enable hardirq may
      again interrupt the current routine of softirq, and then
      try to raise softirq again. This only wastes CPU cycles
      and won't have any real gain.
      
      Since IB_CQ_REPORT_MISSED_EVENTS already make sure if
      ib_req_notify_cq() returns 0, it is safe to wait for the
      next event, with no need to poll the CQ again in this case.
      
      This patch disables hardirq during the processing of softirq,
      and re-arm the CQ after softirq is done. Somehow like NAPI.
      
      Co-developed-by: default avatarGuangguan Wang <guangguan.wang@linux.alibaba.com>
      Signed-off-by: default avatarGuangguan Wang <guangguan.wang@linux.alibaba.com>
      Signed-off-by: default avatarDust Li <dust.li@linux.alibaba.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a505cce6
    • Dust Li's avatar
      net/smc: correct settings of RMB window update limit · 6bf536eb
      Dust Li authored
      
      
      rmbe_update_limit is used to limit announcing receive
      window updating too frequently. RFC7609 request a minimal
      increase in the window size of 10% of the receive buffer
      space. But current implementation used:
      
        min_t(int, rmbe_size / 10, SOCK_MIN_SNDBUF / 2)
      
      and SOCK_MIN_SNDBUF / 2 == 2304 Bytes, which is almost
      always less then 10% of the receive buffer space.
      
      This causes the receiver always sending CDC message to
      update its consumer cursor when it consumes more then 2K
      of data. And as a result, we may encounter something like
      "TCP silly window syndrome" when sending 2.5~8K message.
      
      This patch fixes this using max(rmbe_size / 10, SOCK_MIN_SNDBUF / 2).
      
      With this patch and SMC autocorking enabled, qperf 2K/4K/8K
      tcp_bw test shows 45%/75%/40% increase in throughput respectively.
      
      Signed-off-by: default avatarDust Li <dust.li@linux.alibaba.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6bf536eb
    • Dust Li's avatar
      net/smc: send directly on setting TCP_NODELAY · b70a5cc0
      Dust Li authored
      In commit ea785a1a
      
      ("net/smc: Send directly when
      TCP_CORK is cleared"), we don't use delayed work
      to implement cork.
      
      This patch use the same algorithm, removes the
      delayed work when setting TCP_NODELAY and send
      directly in setsockopt(). This also makes the
      TCP_NODELAY the same as TCP.
      
      Cc: Tony Lu <tonylu@linux.alibaba.com>
      Signed-off-by: default avatarDust Li <dust.li@linux.alibaba.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b70a5cc0
    • Dust Li's avatar
      net/smc: add sysctl for autocorking · 12bbb0d1
      Dust Li authored
      
      
      This add a new sysctl: net.smc.autocorking_size
      
      We can dynamically change the behaviour of autocorking
      by change the value of autocorking_size.
      Setting to 0 disables autocorking in SMC
      
      Signed-off-by: default avatarDust Li <dust.li@linux.alibaba.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      12bbb0d1