Skip to content
  1. Nov 02, 2021
    • James Prestwood's avatar
      net: ndisc: introduce ndisc_evict_nocarrier sysctl parameter · 18ac597a
      James Prestwood authored
      
      
      In most situations the neighbor discovery cache should be cleared on a
      NOCARRIER event which is currently done unconditionally. But for wireless
      roams the neighbor discovery cache can and should remain intact since
      the underlying network has not changed.
      
      This patch introduces a sysctl option ndisc_evict_nocarrier which can
      be disabled by a wireless supplicant during a roam. This allows packets
      to be sent after a roam immediately without having to wait for
      neighbor discovery.
      
      A user reported roughly a 1 second delay after a roam before packets
      could be sent out (note, on IPv4). This delay was due to the ARP
      cache being cleared. During testing of this same scenario using IPv6
      no delay was noticed, but regardless there is no reason to clear
      the ndisc cache for wireless roams.
      
      Signed-off-by: default avatarJames Prestwood <prestwoj@gmail.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      18ac597a
    • James Prestwood's avatar
      net: arp: introduce arp_evict_nocarrier sysctl parameter · fcdb44d0
      James Prestwood authored
      This change introduces a new sysctl parameter, arp_evict_nocarrier.
      When set (default) the ARP cache will be cleared on a NOCARRIER event.
      This new option has been defaulted to '1' which maintains existing
      behavior.
      
      Clearing the ARP cache on NOCARRIER is relatively new, introduced by:
      
      commit 859bd2ef
      Author: David Ahern <dsahern@gmail.com>
      Date:   Thu Oct 11 20:33:49 2018 -0700
      
          net: Evict neighbor entries on carrier down
      
      The reason for this changes is to prevent the ARP cache from being
      cleared when a wireless device roams. Specifically for wireless roams
      the ARP cache should not be cleared because the underlying network has not
      changed. Clearing the ARP cache in this case can introduce significant
      delays sending out packets after a roam.
      
      A user reported such a situation here:
      
      https://lore.kernel.org/linux-wireless/CACsRnHWa47zpx3D1oDq9JYnZWniS8yBwW1h0WAVZ6vrbwL_S0w@mail.gmail.com/
      
      
      
      After some investigation it was found that the kernel was holding onto
      packets until ARP finished which resulted in this 1 second delay. It
      was also found that the first ARP who-has was never responded to,
      which is actually what caues the delay. This change is more or less
      working around this behavior, but again, there is no reason to clear
      the cache on a roam anyways.
      
      As for the unanswered who-has, we know the packet made it OTA since
      it was seen while monitoring. Why it never received a response is
      unknown. In any case, since this is a problem on the AP side of things
      all that can be done is to work around it until it is solved.
      
      Some background on testing/reproducing the packet delay:
      
      Hardware:
       - 2 access points configured for Fast BSS Transition (Though I don't
         see why regular reassociation wouldn't have the same behavior)
       - Wireless station running IWD as supplicant
       - A device on network able to respond to pings (I used one of the APs)
      
      Procedure:
       - Connect to first AP
       - Ping once to establish an ARP entry
       - Start a tcpdump
       - Roam to second AP
       - Wait for operstate UP event, and note the timestamp
       - Start pinging
      
      Results:
      
      Below is the tcpdump after UP. It was recorded the interface went UP at
      10:42:01.432875.
      
      10:42:01.461871 ARP, Request who-has 192.168.254.1 tell 192.168.254.71, length 28
      10:42:02.497976 ARP, Request who-has 192.168.254.1 tell 192.168.254.71, length 28
      10:42:02.507162 ARP, Reply 192.168.254.1 is-at ac:86:74:55:b0:20, length 46
      10:42:02.507185 IP 192.168.254.71 > 192.168.254.1: ICMP echo request, id 52792, seq 1, length 64
      10:42:02.507205 IP 192.168.254.71 > 192.168.254.1: ICMP echo request, id 52792, seq 2, length 64
      10:42:02.507212 IP 192.168.254.71 > 192.168.254.1: ICMP echo request, id 52792, seq 3, length 64
      10:42:02.507219 IP 192.168.254.71 > 192.168.254.1: ICMP echo request, id 52792, seq 4, length 64
      10:42:02.507225 IP 192.168.254.71 > 192.168.254.1: ICMP echo request, id 52792, seq 5, length 64
      10:42:02.507232 IP 192.168.254.71 > 192.168.254.1: ICMP echo request, id 52792, seq 6, length 64
      10:42:02.515373 IP 192.168.254.1 > 192.168.254.71: ICMP echo reply, id 52792, seq 1, length 64
      10:42:02.521399 IP 192.168.254.1 > 192.168.254.71: ICMP echo reply, id 52792, seq 2, length 64
      10:42:02.521612 IP 192.168.254.1 > 192.168.254.71: ICMP echo reply, id 52792, seq 3, length 64
      10:42:02.521941 IP 192.168.254.1 > 192.168.254.71: ICMP echo reply, id 52792, seq 4, length 64
      10:42:02.522419 IP 192.168.254.1 > 192.168.254.71: ICMP echo reply, id 52792, seq 5, length 64
      10:42:02.523085 IP 192.168.254.1 > 192.168.254.71: ICMP echo reply, id 52792, seq 6, length 64
      
      You can see the first ARP who-has went out very quickly after UP, but
      was never responded to. Nearly a second later the kernel retries and
      gets a response. Only then do the ping packets go out. If an ARP entry
      is manually added prior to UP (after the cache is cleared) it is seen
      that the first ping is never responded to, so its not only an issue with
      ARP but with data packets in general.
      
      As mentioned prior, the wireless interface was also monitored to verify
      the ping/ARP packet made it OTA which was observed to be true.
      
      Signed-off-by: default avatarJames Prestwood <prestwoj@gmail.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      fcdb44d0
    • Jean Sacren's avatar
      net: vmxnet3: remove multiple false checks in vmxnet3_ethtool.c · 1d6d336f
      Jean Sacren authored
      
      
      In one if branch, (ec->rx_coalesce_usecs != 0) is checked.  When it is
      checked again in two more places, it is always false and has no effect
      on the whole check expression.  We should remove it in both places.
      
      In another if branch, (ec->use_adaptive_rx_coalesce != 0) is checked.
      When it is checked again, it is always false.  We should remove the
      entire branch with it.
      
      In addition we might as well let C precedence dictate by getting rid of
      two pairs of parentheses in the neighboring lines in order to keep
      expressions on both sides of '||' in balance with checkpatch warning
      silenced.
      
      Signed-off-by: default avatarJean Sacren <sakiwit@gmail.com>
      Link: https://lore.kernel.org/r/20211031012728.8325-1-sakiwit@gmail.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      1d6d336f
    • Jakub Kicinski's avatar
      Merge branch 'accurate-memory-charging-for-msg_zerocopy' · 8a75e30e
      Jakub Kicinski authored
      Talal Ahmad says:
      
      ====================
      Accurate Memory Charging For MSG_ZEROCOPY
      
      This series improves the accuracy of msg_zerocopy memory accounting.
      At present, when msg_zerocopy is used memory is charged twice for the
      data - once when user space allocates it, and then again within
      __zerocopy_sg_from_iter. The memory charging in the kernel is excessive
      because data is held in user pages and is never actually copied to skb
      fragments. This leads to incorrectly inflated memory statistics for
      programs passing MSG_ZEROCOPY.
      
      We reduce this inaccuracy by introducing the notion of "pure" zerocopy
      SKBs - where all the frags in the SKB are backed by pinned userspace
      pages, and none are backed by copied pages. For such SKBs, tracked via
      the new SKBFL_PURE_ZEROCOPY flag, we elide sk_mem_charge/uncharge
      calls, leading to more accurate accounting.
      
      However, SKBs can also be coalesced by the stack at present,
      potentially leading to "impure" SKBs. We restrict this coalescing so
      it can only happen within the sendmsg() system call itself, for the
      most recently allocated SKB. While this can lead to a small degree of
      double-charging of memory, this case does not arise often in practice
      for workloads that set MSG_ZEROCOPY.
      
      Testing verified that memory usage in the kernel is lowered.
      Instrumentation with counters also showed that accounting at time
      charging and uncharging is balanced.
      ====================
      
      Link: https://lore.kernel.org/r/20211030020542.3870542-1-mailtalalahmad@gmail.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      8a75e30e
    • Talal Ahmad's avatar
      net: avoid double accounting for pure zerocopy skbs · f1a456f8
      Talal Ahmad authored
      
      
      Track skbs with only zerocopy data and avoid charging them to kernel
      memory to correctly account the memory utilization for msg_zerocopy.
      All of the data in such skbs is held in user pages which are already
      accounted to user. Before this change, they are charged again in
      kernel in __zerocopy_sg_from_iter. The charging in kernel is
      excessive because data is not being copied into skb frags. This
      excessive charging can lead to kernel going into memory pressure
      state which impacts all sockets in the system adversely. Mark pure
      zerocopy skbs with a SKBFL_PURE_ZEROCOPY flag and remove
      charge/uncharge for data in such skbs.
      
      Initially, an skb is marked pure zerocopy when it is empty and in
      zerocopy path. skb can then change from a pure zerocopy skb to mixed
      data skb (zerocopy and copy data) if it is at tail of write queue and
      there is room available in it and non-zerocopy data is being sent in
      the next sendmsg call. At this time sk_mem_charge is done for the pure
      zerocopied data and the pure zerocopy flag is unmarked. We found that
      this happens very rarely on workloads that pass MSG_ZEROCOPY.
      
      A pure zerocopy skb can later be coalesced into normal skb if they are
      next to each other in queue but this patch prevents coalescing from
      happening. This avoids complexity of charging when skb downgrades from
      pure zerocopy to mixed. This is also rare.
      
      In sk_wmem_free_skb, if it is a pure zerocopy skb, an sk_mem_uncharge
      for SKB_TRUESIZE(MAX_TCP_HEADER) is done for sk_mem_charge in
      tcp_skb_entail for an skb without data.
      
      Testing with the msg_zerocopy.c benchmark between two hosts(100G nics)
      with zerocopy showed that before this patch the 'sock' variable in
      memory.stat for cgroup2 that tracks sum of sk_forward_alloc,
      sk_rmem_alloc and sk_wmem_queued is around 1822720 and with this
      change it is 0. This is due to no charge to sk_forward_alloc for
      zerocopy data and shows memory utilization for kernel is lowered.
      
      Signed-off-by: default avatarTalal Ahmad <talalahmad@google.com>
      Acked-by: default avatarArjun Roy <arjunroy@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      f1a456f8
    • Talal Ahmad's avatar
      tcp: rename sk_wmem_free_skb · 03271f3a
      Talal Ahmad authored
      
      
      sk_wmem_free_skb() is only used by TCP.
      
      Rename it to make this clear, and move its declaration to
      include/net/tcp.h
      
      Signed-off-by: default avatarTalal Ahmad <talalahmad@google.com>
      Acked-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: default avatarArjun Roy <arjunroy@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      03271f3a
    • Jakub Kicinski's avatar
      netdevsim: fix uninit value in nsim_drv_configure_vfs() · 047304d0
      Jakub Kicinski authored
      
      
      Build bot points out that I missed initializing ret
      after refactoring.
      
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Fixes: 1c401078 ("netdevsim: move details of vf config to dev")
      Link: https://lore.kernel.org/r/20211101221845.3188490-1-kuba@kernel.org
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      047304d0
  2. Nov 01, 2021
    • David S. Miller's avatar
      Merge branch 'SMC-tracepoints' · d4a07dc5
      David S. Miller authored
      
      
      Tony Lu says:
      
      ====================
      Tracepoints for SMC
      
      This patch set introduces tracepoints for SMC, including the tracepoints
      basic code. The tracepoitns would help us to track SMC's behaviors by
      automatic tools, or other BPF tools, and zero overhead if not enabled.
      
      Compared with kprobe and other dymatic tools, the tracepoints are
      considered as stable API, and less overhead for tracing with easy-to-use
      API.
      ====================
      
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d4a07dc5
    • Tony Lu's avatar
      net/smc: Introduce tracepoint for smcr link down · a3a0e81b
      Tony Lu authored
      
      
      SMC-R link down event is important to help us find links' issues, we
      should track this event, especially in the single nic mode, which means
      upper layer connection would be shut down. Then find out the direct
      link-down reason in time, not only increased the counter, also the
      location of the code who triggered this event.
      
      Signed-off-by: default avatarTony Lu <tonylu@linux.alibaba.com>
      Reviewed-by: default avatarWen Gu <guwen@linux.alibaba.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a3a0e81b
    • Tony Lu's avatar
      net/smc: Introduce tracepoints for tx and rx msg · aff3083f
      Tony Lu authored
      
      
      This introduce two tracepoints for smc tx and rx msg to help us
      diagnosis issues of data path. These two tracepoitns don't cover the
      path of CORK or MSG_MORE in tx, just the top half of data path.
      
      Signed-off-by: default avatarTony Lu <tonylu@linux.alibaba.com>
      Reviewed-by: default avatarWen Gu <guwen@linux.alibaba.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      aff3083f
    • Tony Lu's avatar
      net/smc: Introduce tracepoint for fallback · 48262608
      Tony Lu authored
      
      
      This introduces tracepoint for smc fallback to TCP, so that we can track
      which connection and why it fallbacks, and map the clcsocks' pointer with
      /proc/net/tcp to find more details about TCP connections. Compared with
      kprobe or other dynamic tracing, tracepoints are stable and easy to use.
      
      Signed-off-by: default avatarTony Lu <tonylu@linux.alibaba.com>
      Reviewed-by: default avatarWen Gu <guwen@linux.alibaba.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      48262608
    • David S. Miller's avatar
      Merge branch 'amt-driver' · 60088891
      David S. Miller authored
      Taehee Yoo says:
      
      ====================
      amt: add initial driver for Automatic Multicast Tunneling (AMT)
      
      This is an implementation of AMT(Automatic Multicast Tunneling), RFC 7450.
      https://datatracker.ietf.org/doc/html/rfc7450
      
      
      
      This implementation supports IGMPv2, IGMPv3, MLDv1, MLDv2, and IPv4
      underlay.
      
       Summary of RFC 7450
      The purpose of this protocol is to provide multicast tunneling.
      The main use-case of this protocol is to provide delivery multicast
      traffic from a multicast-enabled network to sites that lack multicast
      connectivity to the source network.
      There are two roles in AMT protocol, Gateway, and Relay.
      The main purpose of Gateway mode is to forward multicast listening
      information(IGMP, MLD) to the source.
      The main purpose of Relay mode is to forward multicast data to listeners.
      These multicast traffics(IGMP, MLD, multicast data packets) are tunneled.
      
      Listeners are located behind Gateway endpoint.
      But gateway itself can be a listener too.
      Senders are located behind Relay endpoint.
      
          ___________       _________       _______       ________
         |           |     |         |     |       |     |        |
         | Listeners <-----> Gateway <-----> Relay <-----> Source |
         |___________|     |_________|     |_______|     |________|
            IGMP/MLD---------(encap)----------->
               <-------------(decap)--------(encap)------Multicast Data
      
       Usage of AMT interface
      1. Create gateway interface
      ip link add amtg type amt mode gateway local 10.0.0.1 discovery 10.0.0.2 \
      dev gw1_rt gateway_port 2268 relay_port 2268
      
      2. Create Relay interface
      ip link add amtr type amt mode relay local 10.0.0.2 dev relay_rt \
      relay_port 2268 max_tunnels 4
      
      v1 -> v2:
       - Eliminate sparse warnings.
         - Use bool type instead of __be16 for identifying v4/v6 protocol.
      
      v2 -> v3:
       - Fix compile warning due to unsed variable.
       - Add missing spinlock comment.
       - Update help message of amt in Kconfig.
      
      v3 -> v4:
       - Split patch.
       - Use CHECKSUM_NONE instead of CHECKSUM_UNNECESSARY.
       - Fix compile error.
      
      v4 -> v5:
       - Remove unnecessary rcu_read_lock().
       - Remove unnecessary amt_change_mtu().
       - Change netlink error message.
       - Add validation for IFLA_AMT_LOCAL_IP and IFLA_AMT_DISCOVERY_IP.
       - Add comments in amt.h.
       - Add missing dev_put() in error path of amt_newlink().
       - Fix typo.
       - Add BUILD_BUG_ON() in amt_smb_cb().
       - Use macro instead of magic values.
       - Use kzalloc() instead of kmalloc().
       - Add selftest script.
      
      v5 -> v6:
       - Reset remote_ip in amt_dev_stop().
      
      v6 -> v7:
       - Fix compile error.
      ====================
      
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      60088891
    • Taehee Yoo's avatar
      selftests: add amt interface selftest script · c08e8bae
      Taehee Yoo authored
      
      
      This is selftest script for amt interface.
      This script includes basic forwarding scenarion and torture scenario.
      
      Signed-off-by: default avatarTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c08e8bae
    • Taehee Yoo's avatar
      amt: add mld report message handler · b75f7095
      Taehee Yoo authored
      
      
      In the previous patch, igmp report handler was added.
      That handler can be used for mld too.
      So, it uses that common code to parse mld report message.
      
      Signed-off-by: default avatarTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b75f7095
    • Taehee Yoo's avatar
      amt: add multicast(IGMP) report message handler · bc54e49c
      Taehee Yoo authored
      
      
      amt 'Relay' interface manages multicast groups(igmp/mld) and sources.
      In order to manage, it should have the function to parse igmp/mld
      report messages. So, this adds the logic for parsing igmp report messages
      and saves them on their own data structure.
      
         struct amt_group_node means one group(igmp/mld).
         struct amt_source_node means one source.
      
      The same source can't exist in the same group.
      The same group can exist in the same tunnel because it manages
      the host address too.
      
      The group information is used when forwarding multicast data.
      If there are no groups in the specific tunnel, Relay doesn't forward it.
      
      Although Relay manages sources, it doesn't support the source filtering
      feature. Because the reason to manage sources is just that in order
      to manage group more correctly.
      
      In the next patch, MLD part will be added.
      
      Signed-off-by: default avatarTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bc54e49c
    • Taehee Yoo's avatar
      amt: add data plane of amt interface · cbc21dc1
      Taehee Yoo authored
      
      
      Before forwarding multicast traffic, the amt interface establishes between
      gateway and relay. In order to establish, amt defined some message type
      and those message flow looks like the below.
      
                            Gateway                  Relay
                            -------                  -----
                               :        Request        :
                           [1] |           N           |
                               |---------------------->|
                               |    Membership Query   | [2]
                               |    N,MAC,gADDR,gPORT  |
                               |<======================|
                           [3] |   Membership Update   |
                               |   ({G:INCLUDE({S})})  |
                               |======================>|
                               |                       |
          ---------------------:-----------------------:---------------------
         |                     |                       |                     |
         |                     |    *Multicast Data    |  *IP Packet(S,G)    |
         |                     |      gADDR,gPORT      |<-----------------() |
         |    *IP Packet(S,G)  |<======================|                     |
         | ()<-----------------|                       |                     |
         |                     |                       |                     |
          ---------------------:-----------------------:---------------------
                               ~                       ~
                               ~        Request        ~
                           [4] |           N'          |
                               |---------------------->|
                               |   Membership Query    | [5]
                               | N',MAC',gADDR',gPORT' |
                               |<======================|
                           [6] |                       |
                               |       Teardown        |
                               |   N,MAC,gADDR,gPORT   |
                               |---------------------->|
                               |                       | [7]
                               |   Membership Update   |
                               |  ({G:INCLUDE({S})})   |
                               |======================>|
                               |                       |
          ---------------------:-----------------------:---------------------
         |                     |                       |                     |
         |                     |    *Multicast Data    |  *IP Packet(S,G)    |
         |                     |     gADDR',gPORT'     |<-----------------() |
         |    *IP Packet (S,G) |<======================|                     |
         | ()<-----------------|                       |                     |
         |                     |                       |                     |
          ---------------------:-----------------------:---------------------
                               |                       |
                               :                       :
      
      1. Discovery
       - Sent by Gateway to Relay
       - To find Relay unique ip address
      2. Advertisement
       - Sent by Relay to Gateway
       - Contains the unique IP address
      3. Request
       - Sent by Gateway to Relay
       - Solicit to receive 'Query' message.
      4. Query
       - Sent by Relay to Gateway
       - Contains General Query message.
      5. Update
       - Sent by  Gateway to Relay
       - Contains report message.
      6. Multicast Data
       - Sent by Relay to Gateway
       - encapsulated multicast traffic.
      7. Teardown
       - Not supported at this time.
      
      Except for the Teardown message, it supports all messages.
      
      In the next patch, IGMP/MLD logic will be added.
      
      Signed-off-by: default avatarTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cbc21dc1
    • Taehee Yoo's avatar
      amt: add control plane of amt interface · b9022b53
      Taehee Yoo authored
      
      
      It adds definitions and control plane code for AMT.
      this is very similar to udp tunneling interfaces such as gtp, vxlan, etc.
      In the next patch, data plane code will be added.
      
      Signed-off-by: default avatarTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b9022b53
    • David S. Miller's avatar
      Merge branch 'netdevsim-device-and-bus' · 741948ff
      David S. Miller authored
      
      
      Jakub Kicinski says:
      
      ====================
      netdevsim: improve separation between device and bus
      
      VF config falls strangely in between device and bus
      responsibilities today. Because of this bus.c sticks fingers
      directly into struct nsim_dev and we look at nsim_bus_dev
      in many more places than necessary.
      
      Make bus.c contain pure interface code, and move
      the particulars of the logic (which touch on eswitch,
      devlink reloads etc) to dev.c. Rename the functions
      at the boundary of the interface to make the separation
      clearer.
      ====================
      
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      741948ff
    • Jakub Kicinski's avatar
      netdevsim: rename 'driver' entry points · a66f64b8
      Jakub Kicinski authored
      
      
      Rename functions serving as driver entry points
      from nsim_dev_... to nsim_drv_... this makes the
      API boundary between bus and dev clearer.
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a66f64b8
    • Jakub Kicinski's avatar
      netdevsim: move max vf config to dev · a3353ec3
      Jakub Kicinski authored
      
      
      max_vfs is a strange little beast because the file
      hangs off of nsim's debugfs, but it configures a field
      in the bus device. Move it to dev.c, let's look at it
      as if the device driver was imposing VF limit based
      on FW info (like pci_sriov_set_totalvfs()).
      
      Again, when moving refactor the function not to hold
      the vfs lock pointlessly while parsing the input.
      Wrap the access from the read side in READ_ONCE()
      to appease concurrency checkers. Do not check if
      return value from snprintf() is negative...
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a3353ec3
    • Jakub Kicinski's avatar
      netdevsim: move details of vf config to dev · 1c401078
      Jakub Kicinski authored
      
      
      Since "eswitch" configuration was added bus.c contains
      a lot of device details which really belong to dev.c.
      
      Restructure the code while moving it.
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1c401078
    • Jakub Kicinski's avatar
      netdevsim: move vfconfig to nsim_dev · 5e388f3d
      Jakub Kicinski authored
      
      
      When netdevsim got split into the faux bus vfconfig ended
      up in the bus device (think pci_dev) which is strange because
      it contains very networky not to say netdevy information.
      Move it to nsim_dev, which is the driver "priv" structure
      for the device.
      
      To make sure we don't race with probe/remove take
      the device lock (much like PCI).
      
      While at it remove the NULL-checking of vfconfigs.
      It appears to be pointless.
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5e388f3d
    • Jakub Kicinski's avatar
      netdevsim: take rtnl_lock when assigning num_vfs · 26c37d89
      Jakub Kicinski authored
      
      
      Legacy VF NDOs look at num_vfs and then based on that
      index into vfconfig. If we don't rtnl_lock() num_vfs
      may get set to 0 and vfconfig freed/replaced while
      the NDO is running.
      
      We don't need to protect replacing vfconfig since it's
      only done when num_vfs is 0.
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      26c37d89
    • David S. Miller's avatar
      Merge branch 'devlink-locking' · 1adc58ea
      David S. Miller authored
      Jakub Kicinski says:
      
      ====================
      improve ethtool/rtnl vs devlink locking
      
      During ethtool netlink development we decided to move some of
      the commmands to devlink. Since we don't want drivers to implement
      both devlink and ethtool version of the commands ethtool ioctl
      falls back to calling devlink. Unfortunately devlink locks must
      be taken before rtnl_lock. This results in a questionable
      dev_hold() / rtnl_unlock() / devlink / rtnl_lock() / dev_put()
      pattern.
      
      This method "works" but it working depends on drivers in question
      not doing much in ethtool_ops->begin / complete, and on the netdev
      not having needs_free_netdev set.
      
      Since commit 437ebfd9
      
       ("devlink: Count struct devlink consumers")
      we can hold a reference on a devlink instance and prevent it from
      going away (sort of like netdev with dev_hold()). We can use this
      to create a more natural reference nesting where we get a ref on
      the devlink instance and make the devlink call entirely outside
      of the rtnl_lock section.
      ====================
      
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1adc58ea
    • Jakub Kicinski's avatar
      ethtool: don't drop the rtnl_lock half way thru the ioctl · 1af0a094
      Jakub Kicinski authored
      
      
      devlink compat code needs to drop rtnl_lock to take
      devlink->lock to ensure correct lock ordering.
      
      This is problematic because we're not strictly guaranteed
      that the netdev will not disappear after we re-lock.
      It may open a possibility of nested ->begin / ->complete
      calls.
      
      Instead of calling into devlink under rtnl_lock take
      a ref on the devlink instance and make the call after
      we've dropped rtnl_lock.
      
      We (continue to) assume that netdevs have an implicit
      reference on the devlink returned from ndo_get_devlink_port
      
      Note that ndo_get_devlink_port will now get called
      under rtnl_lock. That should be fine since none of
      the drivers seem to be taking serious locks inside
      ndo_get_devlink_port.
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Reviewed-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1af0a094
    • Jakub Kicinski's avatar
      devlink: expose get/put functions · 46db1b77
      Jakub Kicinski authored
      
      
      Allow those who hold implicit reference on a devlink instance
      to try to take a full ref on it. This will be used from netdev
      code which has an implicit ref because of driver call ordering.
      
      Note that after recent changes devlink_unregister() may happen
      before netdev unregister, but devlink_free() should still happen
      after, so we are safe to try, but we can't just refcount_inc()
      and assume it's not zero.
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Reviewed-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      46db1b77
    • Jakub Kicinski's avatar
      ethtool: handle info/flash data copying outside rtnl_lock · 095cfcfe
      Jakub Kicinski authored
      
      
      We need to increase the lifetime of the data for .get_info
      and .flash_update beyond their handlers inside rtnl_lock.
      
      Allocate a union on the heap and use it instead.
      
      Note that we now copy the ethcmd before we lookup dev,
      hopefully there is no crazy user space depending on error
      codes.
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Reviewed-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      095cfcfe
    • Jakub Kicinski's avatar
      ethtool: push the rtnl_lock into dev_ethtool() · f49deaa6
      Jakub Kicinski authored
      
      
      Don't take the lock in net/core/dev_ioctl.c,
      we'll have things to do outside rtnl_lock soon.
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Reviewed-by: default avatarLeon Romanovsky <leonro@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f49deaa6
    • David S. Miller's avatar
      Merge branch 'mana-misc' · c6e03dbe
      David S. Miller authored
      
      
      Dexuan Cui says:
      
      ====================
      net: mana: some misc patches
      
      Patch 1 is a small fix.
      
      Patch 2 reports OS info to the PF driver.
      Before the patch, the req fields were all zeros.
      
      Patch 3 fixes and cleans up the error handling of HWC creation failure.
      
      Patch 4 adds the callbacks for hibernation/kexec. It's based on patch 3.
      ====================
      
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c6e03dbe
    • Dexuan Cui's avatar
      net: mana: Support hibernation and kexec · 635096a8
      Dexuan Cui authored
      
      
      Implement the suspend/resume/shutdown callbacks for hibernation/kexec.
      
      Add mana_gd_setup() and mana_gd_cleanup() for some common code, and
      use them in the mand_gd_* callbacks.
      
      Reuse mana_probe/remove() for the hibernation path.
      
      Signed-off-by: default avatarDexuan Cui <decui@microsoft.com>
      Reviewed-by: default avatarHaiyang Zhang <haiyangz@microsoft.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      635096a8
    • Dexuan Cui's avatar
      net: mana: Improve the HWC error handling · 62ea8b77
      Dexuan Cui authored
      
      
      Currently when the HWC creation fails, the error handling is flawed,
      e.g. if mana_hwc_create_channel() -> mana_hwc_establish_channel() fails,
      the resources acquired in mana_hwc_init_queues() is not released.
      
      Enhance mana_hwc_destroy_channel() to do the proper cleanup work and
      call it accordingly.
      
      Signed-off-by: default avatarDexuan Cui <decui@microsoft.com>
      Reviewed-by: default avatarHaiyang Zhang <haiyangz@microsoft.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      62ea8b77
    • Dexuan Cui's avatar
      net: mana: Report OS info to the PF driver · 3c37f357
      Dexuan Cui authored
      
      
      The PF driver might use the OS info for statistical purposes.
      
      Signed-off-by: default avatarDexuan Cui <decui@microsoft.com>
      Reviewed-by: default avatarHaiyang Zhang <haiyangz@microsoft.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3c37f357
    • Dexuan Cui's avatar
      net: mana: Fix the netdev_err()'s vPort argument in mana_init_port() · 6c7ea696
      Dexuan Cui authored
      
      
      Use the correct port index rather than 0.
      
      Signed-off-by: default avatarDexuan Cui <decui@microsoft.com>
      Reviewed-by: default avatarHaiyang Zhang <haiyangz@microsoft.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6c7ea696
    • David S. Miller's avatar
      Merge branch 'mptcp-selftests' · 986d2e3d
      David S. Miller authored
      
      
      Mat Martineau says:
      
      ====================
      mptcp: Some selftest improvements
      
      Here are a couple of selftest changes for MPTCP.
      
      Patch 1 fixes a mistake where the wrong protocol (TCP vs MPTCP) could be
      requested on the listening socket in some link failure tests.
      
      Patch 2 refactors the simulataneous flow tests to improve timing
      accuracy and give more consistent results.
      ====================
      
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      986d2e3d
    • Paolo Abeni's avatar
      selftests: mptcp: more stable simult_flows tests · b6ab64b0
      Paolo Abeni authored
      
      
      Currently the simult_flows.sh self-tests are not very stable,
      especially when running on slow VMs.
      
      The tests measure runtime for transfers on multiple subflows
      and check that the time is near the theoretical maximum.
      
      The current test infra introduces a bit of jitter in test
      runtime, due to multiple explicit delays. Additionally the
      runtime is measured by the shell script wrapper. On a slow
      VM, the script overhead is measurable and subject to relevant
      jitter.
      
      One solution to make the test more stable would be adding more
      slack to the expected time; that could possibly hide real
      regressions. Instead move the measurement inside the command
      doing the transfer, and drop most unneeded sleeps.
      
      Reviewed-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b6ab64b0
    • Geliang Tang's avatar
      selftests: mptcp: fix proto type in link_failure tests · 7c909a98
      Geliang Tang authored
      In listener_ns, we should pass srv_proto argument to mptcp_connect command,
      not cl_proto.
      
      Fixes: 7d1e6f16
      
       ("selftests: mptcp: add testcase for active-back")
      Signed-off-by: default avatarGeliang Tang <geliang.tang@suse.com>
      Signed-off-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7c909a98
    • Yu Xiao's avatar
      nfp: flower: Allow ipv6gretap interface for offloading · f7536ffb
      Yu Xiao authored
      
      
      The tunnel_type check only allows for "netif_is_gretap", but for
      OVS the port is actually "netif_is_ip6gretap" when setting up GRE
      for ipv6, which means offloading request was rejected before.
      
      Therefore, adding "netif_is_ip6gretap" allow ipv6gretap interface
      for offloading.
      
      Signed-off-by: default avatarYu Xiao <yu.xiao@corigine.com>
      Signed-off-by: default avatarLouis Peens <louis.peens@corigine.com>
      Signed-off-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f7536ffb
    • Marek Behún's avatar
      net: dsa: populate supported_interfaces member · c07c6e8e
      Marek Behún authored
      
      
      Add a new DSA switch operation, phylink_get_interfaces, which should
      fill in which PHY_INTERFACE_MODE_* are supported by given port.
      
      Use this before phylink_create() to fill phylinks supported_interfaces
      member, allowing phylink to determine which PHY_INTERFACE_MODEs are
      supported.
      
      Signed-off-by: default avatarMarek Behún <kabel@kernel.org>
      [tweaked patch and description to add more complete support -- rmk]
      Signed-off-by: default avatarRussell King <rmk+kernel@armlinux.org.uk>
      Signed-off-by: default avatarRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c07c6e8e
    • David S. Miller's avatar
      Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue · ebed1cf5
      David S. Miller authored
      
      
      Tony Nguyen says:
      
      ====================
      100GbE Intel Wired LAN Driver Updates 2021-10-29
      
      This series contains updates to ice and iavf drivers and virtchnl header
      file.
      
      Brett removes vlan_promisc argument from a function call for ice driver.
      In the virtchnl header file he removes an unused, reserved define and
      converts raw value defines to instead use the BIT macro.
      
      Marcin adds syncing of MAC addresses when creating switchdev VFs to
      remove error messages on link up and stops showing buffer information
      for port representors to remove duplicated entries being displayed for
      ice driver.
      
      Karen introduces a helper to go from pci_dev to iavf_adapter in the
      iavf driver.
      
      Przemyslaw fixes an issue where iavf was attempting to free IRQs before
      calling disable.
      ====================
      
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ebed1cf5
    • David S. Miller's avatar
      Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next · 06f1ecd4
      David S. Miller authored
      
      
      Steffen Klassert says:
      
      ====================
      pull request (net-next): ipsec-next 2021-10-30
      
      Just two minor changes this time:
      
      1) Remove some superfluous header files from xfrm4_tunnel.c
         From Mianhan Liu.
      
      2) Simplify some error checks in xfrm_input().
         From luo penghao.
      
      Please pull or let me know if there are problems.
      ====================
      
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      06f1ecd4