Commits · 18ac597af25e9760b76471524096f5b29eb820e6 · Mirrors / github.com / sifive_riscv-linux

Nov 02, 2021

net: ndisc: introduce ndisc_evict_nocarrier sysctl parameter · 18ac597a

James Prestwood authored Nov 01, 2021



In most situations the neighbor discovery cache should be cleared on a
NOCARRIER event which is currently done unconditionally. But for wireless
roams the neighbor discovery cache can and should remain intact since
the underlying network has not changed.

This patch introduces a sysctl option ndisc_evict_nocarrier which can
be disabled by a wireless supplicant during a roam. This allows packets
to be sent after a roam immediately without having to wait for
neighbor discovery.

A user reported roughly a 1 second delay after a roam before packets
could be sent out (note, on IPv4). This delay was due to the ARP
cache being cleared. During testing of this same scenario using IPv6
no delay was noticed, but regardless there is no reason to clear
the ndisc cache for wireless roams.

Signed-off-by: James Prestwood <prestwoj@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

18ac597a

net: arp: introduce arp_evict_nocarrier sysctl parameter · fcdb44d0

James Prestwood authored Nov 01, 2021

This change introduces a new sysctl parameter, arp_evict_nocarrier.
When set (default) the ARP cache will be cleared on a NOCARRIER event.
This new option has been defaulted to '1' which maintains existing
behavior.

Clearing the ARP cache on NOCARRIER is relatively new, introduced by:

commit 859bd2ef
Author: David Ahern <dsahern@gmail.com>
Date:   Thu Oct 11 20:33:49 2018 -0700

    net: Evict neighbor entries on carrier down

The reason for this changes is to prevent the ARP cache from being
cleared when a wireless device roams. Specifically for wireless roams
the ARP cache should not be cleared because the underlying network has not
changed. Clearing the ARP cache in this case can introduce significant
delays sending out packets after a roam.

A user reported such a situation here:

https://lore.kernel.org/linux-wireless/CACsRnHWa47zpx3D1oDq9JYnZWniS8yBwW1h0WAVZ6vrbwL_S0w@mail.gmail.com/



After some investigation it was found that the kernel was holding onto
packets until ARP finished which resulted in this 1 second delay. It
was also found that the first ARP who-has was never responded to,
which is actually what caues the delay. This change is more or less
working around this behavior, but again, there is no reason to clear
the cache on a roam anyways.

As for the unanswered who-has, we know the packet made it OTA since
it was seen while monitoring. Why it never received a response is
unknown. In any case, since this is a problem on the AP side of things
all that can be done is to work around it until it is solved.

Some background on testing/reproducing the packet delay:

Hardware:
 - 2 access points configured for Fast BSS Transition (Though I don't
   see why regular reassociation wouldn't have the same behavior)
 - Wireless station running IWD as supplicant
 - A device on network able to respond to pings (I used one of the APs)

Procedure:
 - Connect to first AP
 - Ping once to establish an ARP entry
 - Start a tcpdump
 - Roam to second AP
 - Wait for operstate UP event, and note the timestamp
 - Start pinging

Results:

Below is the tcpdump after UP. It was recorded the interface went UP at
10:42:01.432875.

10:42:01.461871 ARP, Request who-has 192.168.254.1 tell 192.168.254.71, length 28
10:42:02.497976 ARP, Request who-has 192.168.254.1 tell 192.168.254.71, length 28
10:42:02.507162 ARP, Reply 192.168.254.1 is-at ac:86:74:55:b0:20, length 46
10:42:02.507185 IP 192.168.254.71 > 192.168.254.1: ICMP echo request, id 52792, seq 1, length 64
10:42:02.507205 IP 192.168.254.71 > 192.168.254.1: ICMP echo request, id 52792, seq 2, length 64
10:42:02.507212 IP 192.168.254.71 > 192.168.254.1: ICMP echo request, id 52792, seq 3, length 64
10:42:02.507219 IP 192.168.254.71 > 192.168.254.1: ICMP echo request, id 52792, seq 4, length 64
10:42:02.507225 IP 192.168.254.71 > 192.168.254.1: ICMP echo request, id 52792, seq 5, length 64
10:42:02.507232 IP 192.168.254.71 > 192.168.254.1: ICMP echo request, id 52792, seq 6, length 64
10:42:02.515373 IP 192.168.254.1 > 192.168.254.71: ICMP echo reply, id 52792, seq 1, length 64
10:42:02.521399 IP 192.168.254.1 > 192.168.254.71: ICMP echo reply, id 52792, seq 2, length 64
10:42:02.521612 IP 192.168.254.1 > 192.168.254.71: ICMP echo reply, id 52792, seq 3, length 64
10:42:02.521941 IP 192.168.254.1 > 192.168.254.71: ICMP echo reply, id 52792, seq 4, length 64
10:42:02.522419 IP 192.168.254.1 > 192.168.254.71: ICMP echo reply, id 52792, seq 5, length 64
10:42:02.523085 IP 192.168.254.1 > 192.168.254.71: ICMP echo reply, id 52792, seq 6, length 64

You can see the first ARP who-has went out very quickly after UP, but
was never responded to. Nearly a second later the kernel retries and
gets a response. Only then do the ping packets go out. If an ARP entry
is manually added prior to UP (after the cache is cleared) it is seen
that the first ping is never responded to, so its not only an issue with
ARP but with data packets in general.

As mentioned prior, the wireless interface was also monitored to verify
the ping/ARP packet made it OTA which was observed to be true.

Signed-off-by: James Prestwood <prestwoj@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

fcdb44d0

net: vmxnet3: remove multiple false checks in vmxnet3_ethtool.c · 1d6d336f

Jean Sacren authored Oct 30, 2021



In one if branch, (ec->rx_coalesce_usecs != 0) is checked.  When it is
checked again in two more places, it is always false and has no effect
on the whole check expression.  We should remove it in both places.

In another if branch, (ec->use_adaptive_rx_coalesce != 0) is checked.
When it is checked again, it is always false.  We should remove the
entire branch with it.

In addition we might as well let C precedence dictate by getting rid of
two pairs of parentheses in the neighboring lines in order to keep
expressions on both sides of '||' in balance with checkpatch warning
silenced.

Signed-off-by: Jean Sacren <sakiwit@gmail.com>
Link: https://lore.kernel.org/r/20211031012728.8325-1-sakiwit@gmail.com


Signed-off-by: Jakub Kicinski <kuba@kernel.org>

1d6d336f

Merge branch 'accurate-memory-charging-for-msg_zerocopy' · 8a75e30e

Jakub Kicinski authored Nov 01, 2021

Talal Ahmad says:

====================
Accurate Memory Charging For MSG_ZEROCOPY

This series improves the accuracy of msg_zerocopy memory accounting.
At present, when msg_zerocopy is used memory is charged twice for the
data - once when user space allocates it, and then again within
__zerocopy_sg_from_iter. The memory charging in the kernel is excessive
because data is held in user pages and is never actually copied to skb
fragments. This leads to incorrectly inflated memory statistics for
programs passing MSG_ZEROCOPY.

We reduce this inaccuracy by introducing the notion of "pure" zerocopy
SKBs - where all the frags in the SKB are backed by pinned userspace
pages, and none are backed by copied pages. For such SKBs, tracked via
the new SKBFL_PURE_ZEROCOPY flag, we elide sk_mem_charge/uncharge
calls, leading to more accurate accounting.

However, SKBs can also be coalesced by the stack at present,
potentially leading to "impure" SKBs. We restrict this coalescing so
it can only happen within the sendmsg() system call itself, for the
most recently allocated SKB. While this can lead to a small degree of
double-charging of memory, this case does not arise often in practice
for workloads that set MSG_ZEROCOPY.

Testing verified that memory usage in the kernel is lowered.
Instrumentation with counters also showed that accounting at time
charging and uncharging is balanced.
====================

Link: https://lore.kernel.org/r/20211030020542.3870542-1-mailtalalahmad@gmail.com


Signed-off-by: Jakub Kicinski <kuba@kernel.org>

8a75e30e

net: avoid double accounting for pure zerocopy skbs · f1a456f8

Talal Ahmad authored Oct 29, 2021



Track skbs with only zerocopy data and avoid charging them to kernel
memory to correctly account the memory utilization for msg_zerocopy.
All of the data in such skbs is held in user pages which are already
accounted to user. Before this change, they are charged again in
kernel in __zerocopy_sg_from_iter. The charging in kernel is
excessive because data is not being copied into skb frags. This
excessive charging can lead to kernel going into memory pressure
state which impacts all sockets in the system adversely. Mark pure
zerocopy skbs with a SKBFL_PURE_ZEROCOPY flag and remove
charge/uncharge for data in such skbs.

Initially, an skb is marked pure zerocopy when it is empty and in
zerocopy path. skb can then change from a pure zerocopy skb to mixed
data skb (zerocopy and copy data) if it is at tail of write queue and
there is room available in it and non-zerocopy data is being sent in
the next sendmsg call. At this time sk_mem_charge is done for the pure
zerocopied data and the pure zerocopy flag is unmarked. We found that
this happens very rarely on workloads that pass MSG_ZEROCOPY.

A pure zerocopy skb can later be coalesced into normal skb if they are
next to each other in queue but this patch prevents coalescing from
happening. This avoids complexity of charging when skb downgrades from
pure zerocopy to mixed. This is also rare.

In sk_wmem_free_skb, if it is a pure zerocopy skb, an sk_mem_uncharge
for SKB_TRUESIZE(MAX_TCP_HEADER) is done for sk_mem_charge in
tcp_skb_entail for an skb without data.

Testing with the msg_zerocopy.c benchmark between two hosts(100G nics)
with zerocopy showed that before this patch the 'sock' variable in
memory.stat for cgroup2 that tracks sum of sk_forward_alloc,
sk_rmem_alloc and sk_wmem_queued is around 1822720 and with this
change it is 0. This is due to no charge to sk_forward_alloc for
zerocopy data and shows memory utilization for kernel is lowered.

Signed-off-by: Talal Ahmad <talalahmad@google.com>
Acked-by: Arjun Roy <arjunroy@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

f1a456f8

tcp: rename sk_wmem_free_skb · 03271f3a

Talal Ahmad authored Oct 29, 2021



sk_wmem_free_skb() is only used by TCP.

Rename it to make this clear, and move its declaration to
include/net/tcp.h

Signed-off-by: Talal Ahmad <talalahmad@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

03271f3a

netdevsim: fix uninit value in nsim_drv_configure_vfs() · 047304d0

Jakub Kicinski authored Nov 01, 2021



Build bot points out that I missed initializing ret
after refactoring.

Reported-by: kernel test robot <lkp@intel.com>
Fixes: 1c401078 ("netdevsim: move details of vf config to dev")
Link: https://lore.kernel.org/r/20211101221845.3188490-1-kuba@kernel.org


Signed-off-by: Jakub Kicinski <kuba@kernel.org>

047304d0

Nov 01, 2021

Merge branch 'SMC-tracepoints' · d4a07dc5

David S. Miller authored Nov 01, 2021



Tony Lu says:

====================
Tracepoints for SMC

This patch set introduces tracepoints for SMC, including the tracepoints
basic code. The tracepoitns would help us to track SMC's behaviors by
automatic tools, or other BPF tools, and zero overhead if not enabled.

Compared with kprobe and other dymatic tools, the tracepoints are
considered as stable API, and less overhead for tracing with easy-to-use
API.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

d4a07dc5

net/smc: Introduce tracepoint for smcr link down · a3a0e81b

Tony Lu authored Nov 01, 2021



SMC-R link down event is important to help us find links' issues, we
should track this event, especially in the single nic mode, which means
upper layer connection would be shut down. Then find out the direct
link-down reason in time, not only increased the counter, also the
location of the code who triggered this event.

Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>
Reviewed-by: Wen Gu <guwen@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a3a0e81b

net/smc: Introduce tracepoints for tx and rx msg · aff3083f

Tony Lu authored Nov 01, 2021



This introduce two tracepoints for smc tx and rx msg to help us
diagnosis issues of data path. These two tracepoitns don't cover the
path of CORK or MSG_MORE in tx, just the top half of data path.

Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>
Reviewed-by: Wen Gu <guwen@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

aff3083f

net/smc: Introduce tracepoint for fallback · 48262608

Tony Lu authored Nov 01, 2021



This introduces tracepoint for smc fallback to TCP, so that we can track
which connection and why it fallbacks, and map the clcsocks' pointer with
/proc/net/tcp to find more details about TCP connections. Compared with
kprobe or other dynamic tracing, tracepoints are stable and easy to use.

Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>
Reviewed-by: Wen Gu <guwen@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

48262608

Merge branch 'amt-driver' · 60088891

David S. Miller authored Nov 01, 2021

Taehee Yoo says:

====================
amt: add initial driver for Automatic Multicast Tunneling (AMT)

This is an implementation of AMT(Automatic Multicast Tunneling), RFC 7450.
https://datatracker.ietf.org/doc/html/rfc7450



This implementation supports IGMPv2, IGMPv3, MLDv1, MLDv2, and IPv4
underlay.

 Summary of RFC 7450
The purpose of this protocol is to provide multicast tunneling.
The main use-case of this protocol is to provide delivery multicast
traffic from a multicast-enabled network to sites that lack multicast
connectivity to the source network.
There are two roles in AMT protocol, Gateway, and Relay.
The main purpose of Gateway mode is to forward multicast listening
information(IGMP, MLD) to the source.
The main purpose of Relay mode is to forward multicast data to listeners.
These multicast traffics(IGMP, MLD, multicast data packets) are tunneled.

Listeners are located behind Gateway endpoint.
But gateway itself can be a listener too.
Senders are located behind Relay endpoint.

    ___________       _________       _______       ________
   |           |     |         |     |       |     |        |
   | Listeners <-----> Gateway <-----> Relay <-----> Source |
   |___________|     |_________|     |_______|     |________|
      IGMP/MLD---------(encap)----------->
         <-------------(decap)--------(encap)------Multicast Data

 Usage of AMT interface
1. Create gateway interface
ip link add amtg type amt mode gateway local 10.0.0.1 discovery 10.0.0.2 \
dev gw1_rt gateway_port 2268 relay_port 2268

2. Create Relay interface
ip link add amtr type amt mode relay local 10.0.0.2 dev relay_rt \
relay_port 2268 max_tunnels 4

v1 -> v2:
 - Eliminate sparse warnings.
   - Use bool type instead of __be16 for identifying v4/v6 protocol.

v2 -> v3:
 - Fix compile warning due to unsed variable.
 - Add missing spinlock comment.
 - Update help message of amt in Kconfig.

v3 -> v4:
 - Split patch.
 - Use CHECKSUM_NONE instead of CHECKSUM_UNNECESSARY.
 - Fix compile error.

v4 -> v5:
 - Remove unnecessary rcu_read_lock().
 - Remove unnecessary amt_change_mtu().
 - Change netlink error message.
 - Add validation for IFLA_AMT_LOCAL_IP and IFLA_AMT_DISCOVERY_IP.
 - Add comments in amt.h.
 - Add missing dev_put() in error path of amt_newlink().
 - Fix typo.
 - Add BUILD_BUG_ON() in amt_smb_cb().
 - Use macro instead of magic values.
 - Use kzalloc() instead of kmalloc().
 - Add selftest script.

v5 -> v6:
 - Reset remote_ip in amt_dev_stop().

v6 -> v7:
 - Fix compile error.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

60088891

selftests: add amt interface selftest script · c08e8bae

Taehee Yoo authored Oct 31, 2021



This is selftest script for amt interface.
This script includes basic forwarding scenarion and torture scenario.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c08e8bae

amt: add mld report message handler · b75f7095

Taehee Yoo authored Oct 31, 2021



In the previous patch, igmp report handler was added.
That handler can be used for mld too.
So, it uses that common code to parse mld report message.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b75f7095

amt: add multicast(IGMP) report message handler · bc54e49c

Taehee Yoo authored Oct 31, 2021



amt 'Relay' interface manages multicast groups(igmp/mld) and sources.
In order to manage, it should have the function to parse igmp/mld
report messages. So, this adds the logic for parsing igmp report messages
and saves them on their own data structure.

   struct amt_group_node means one group(igmp/mld).
   struct amt_source_node means one source.

The same source can't exist in the same group.
The same group can exist in the same tunnel because it manages
the host address too.

The group information is used when forwarding multicast data.
If there are no groups in the specific tunnel, Relay doesn't forward it.

Although Relay manages sources, it doesn't support the source filtering
feature. Because the reason to manage sources is just that in order
to manage group more correctly.

In the next patch, MLD part will be added.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bc54e49c

amt: add data plane of amt interface · cbc21dc1

Taehee Yoo authored Oct 31, 2021



Before forwarding multicast traffic, the amt interface establishes between
gateway and relay. In order to establish, amt defined some message type
and those message flow looks like the below.

                      Gateway                  Relay
                      -------                  -----
                         :        Request        :
                     [1] |           N           |
                         |---------------------->|
                         |    Membership Query   | [2]
                         |    N,MAC,gADDR,gPORT  |
                         |<======================|
                     [3] |   Membership Update   |
                         |   ({G:INCLUDE({S})})  |
                         |======================>|
                         |                       |
    ---------------------:-----------------------:---------------------
   |                     |                       |                     |
   |                     |    *Multicast Data    |  *IP Packet(S,G)    |
   |                     |      gADDR,gPORT      |<-----------------() |
   |    *IP Packet(S,G)  |<======================|                     |
   | ()<-----------------|                       |                     |
   |                     |                       |                     |
    ---------------------:-----------------------:---------------------
                         ~                       ~
                         ~        Request        ~
                     [4] |           N'          |
                         |---------------------->|
                         |   Membership Query    | [5]
                         | N',MAC',gADDR',gPORT' |
                         |<======================|
                     [6] |                       |
                         |       Teardown        |
                         |   N,MAC,gADDR,gPORT   |
                         |---------------------->|
                         |                       | [7]
                         |   Membership Update   |
                         |  ({G:INCLUDE({S})})   |
                         |======================>|
                         |                       |
    ---------------------:-----------------------:---------------------
   |                     |                       |                     |
   |                     |    *Multicast Data    |  *IP Packet(S,G)    |
   |                     |     gADDR',gPORT'     |<-----------------() |
   |    *IP Packet (S,G) |<======================|                     |
   | ()<-----------------|                       |                     |
   |                     |                       |                     |
    ---------------------:-----------------------:---------------------
                         |                       |
                         :                       :

1. Discovery
 - Sent by Gateway to Relay
 - To find Relay unique ip address
2. Advertisement
 - Sent by Relay to Gateway
 - Contains the unique IP address
3. Request
 - Sent by Gateway to Relay
 - Solicit to receive 'Query' message.
4. Query
 - Sent by Relay to Gateway
 - Contains General Query message.
5. Update
 - Sent by  Gateway to Relay
 - Contains report message.
6. Multicast Data
 - Sent by Relay to Gateway
 - encapsulated multicast traffic.
7. Teardown
 - Not supported at this time.

Except for the Teardown message, it supports all messages.

In the next patch, IGMP/MLD logic will be added.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cbc21dc1

amt: add control plane of amt interface · b9022b53

Taehee Yoo authored Oct 31, 2021



It adds definitions and control plane code for AMT.
this is very similar to udp tunneling interfaces such as gtp, vxlan, etc.
In the next patch, data plane code will be added.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b9022b53

Merge branch 'netdevsim-device-and-bus' · 741948ff

David S. Miller authored Nov 01, 2021



Jakub Kicinski says:

====================
netdevsim: improve separation between device and bus

VF config falls strangely in between device and bus
responsibilities today. Because of this bus.c sticks fingers
directly into struct nsim_dev and we look at nsim_bus_dev
in many more places than necessary.

Make bus.c contain pure interface code, and move
the particulars of the logic (which touch on eswitch,
devlink reloads etc) to dev.c. Rename the functions
at the boundary of the interface to make the separation
clearer.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

741948ff

netdevsim: rename 'driver' entry points · a66f64b8

Jakub Kicinski authored Oct 30, 2021



Rename functions serving as driver entry points
from nsim_dev_... to nsim_drv_... this makes the
API boundary between bus and dev clearer.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

a66f64b8

netdevsim: move max vf config to dev · a3353ec3

Jakub Kicinski authored Oct 30, 2021



max_vfs is a strange little beast because the file
hangs off of nsim's debugfs, but it configures a field
in the bus device. Move it to dev.c, let's look at it
as if the device driver was imposing VF limit based
on FW info (like pci_sriov_set_totalvfs()).

Again, when moving refactor the function not to hold
the vfs lock pointlessly while parsing the input.
Wrap the access from the read side in READ_ONCE()
to appease concurrency checkers. Do not check if
return value from snprintf() is negative...

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

a3353ec3

netdevsim: move details of vf config to dev · 1c401078

Jakub Kicinski authored Oct 30, 2021



Since "eswitch" configuration was added bus.c contains
a lot of device details which really belong to dev.c.

Restructure the code while moving it.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

1c401078

netdevsim: move vfconfig to nsim_dev · 5e388f3d

Jakub Kicinski authored Oct 30, 2021



When netdevsim got split into the faux bus vfconfig ended
up in the bus device (think pci_dev) which is strange because
it contains very networky not to say netdevy information.
Move it to nsim_dev, which is the driver "priv" structure
for the device.

To make sure we don't race with probe/remove take
the device lock (much like PCI).

While at it remove the NULL-checking of vfconfigs.
It appears to be pointless.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

5e388f3d

netdevsim: take rtnl_lock when assigning num_vfs · 26c37d89

Jakub Kicinski authored Oct 30, 2021



Legacy VF NDOs look at num_vfs and then based on that
index into vfconfig. If we don't rtnl_lock() num_vfs
may get set to 0 and vfconfig freed/replaced while
the NDO is running.

We don't need to protect replacing vfconfig since it's
only done when num_vfs is 0.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

26c37d89

Merge branch 'devlink-locking' · 1adc58ea

David S. Miller authored Nov 01, 2021

Jakub Kicinski says:

====================
improve ethtool/rtnl vs devlink locking

During ethtool netlink development we decided to move some of
the commmands to devlink. Since we don't want drivers to implement
both devlink and ethtool version of the commands ethtool ioctl
falls back to calling devlink. Unfortunately devlink locks must
be taken before rtnl_lock. This results in a questionable
dev_hold() / rtnl_unlock() / devlink / rtnl_lock() / dev_put()
pattern.

This method "works" but it working depends on drivers in question
not doing much in ethtool_ops->begin / complete, and on the netdev
not having needs_free_netdev set.

Since commit 437ebfd9

 ("devlink: Count struct devlink consumers")
we can hold a reference on a devlink instance and prevent it from
going away (sort of like netdev with dev_hold()). We can use this
to create a more natural reference nesting where we get a ref on
the devlink instance and make the devlink call entirely outside
of the rtnl_lock section.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

1adc58ea

ethtool: don't drop the rtnl_lock half way thru the ioctl · 1af0a094

Jakub Kicinski authored Oct 30, 2021



devlink compat code needs to drop rtnl_lock to take
devlink->lock to ensure correct lock ordering.

This is problematic because we're not strictly guaranteed
that the netdev will not disappear after we re-lock.
It may open a possibility of nested ->begin / ->complete
calls.

Instead of calling into devlink under rtnl_lock take
a ref on the devlink instance and make the call after
we've dropped rtnl_lock.

We (continue to) assume that netdevs have an implicit
reference on the devlink returned from ndo_get_devlink_port

Note that ndo_get_devlink_port will now get called
under rtnl_lock. That should be fine since none of
the drivers seem to be taking serious locks inside
ndo_get_devlink_port.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1af0a094

devlink: expose get/put functions · 46db1b77

Jakub Kicinski authored Oct 30, 2021



Allow those who hold implicit reference on a devlink instance
to try to take a full ref on it. This will be used from netdev
code which has an implicit ref because of driver call ordering.

Note that after recent changes devlink_unregister() may happen
before netdev unregister, but devlink_free() should still happen
after, so we are safe to try, but we can't just refcount_inc()
and assume it's not zero.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

46db1b77

ethtool: handle info/flash data copying outside rtnl_lock · 095cfcfe

Jakub Kicinski authored Oct 30, 2021



We need to increase the lifetime of the data for .get_info
and .flash_update beyond their handlers inside rtnl_lock.

Allocate a union on the heap and use it instead.

Note that we now copy the ethcmd before we lookup dev,
hopefully there is no crazy user space depending on error
codes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

095cfcfe

ethtool: push the rtnl_lock into dev_ethtool() · f49deaa6

Jakub Kicinski authored Oct 30, 2021



Don't take the lock in net/core/dev_ioctl.c,
we'll have things to do outside rtnl_lock soon.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f49deaa6

Merge branch 'mana-misc' · c6e03dbe

David S. Miller authored Nov 01, 2021



Dexuan Cui says:

====================
net: mana: some misc patches

Patch 1 is a small fix.

Patch 2 reports OS info to the PF driver.
Before the patch, the req fields were all zeros.

Patch 3 fixes and cleans up the error handling of HWC creation failure.

Patch 4 adds the callbacks for hibernation/kexec. It's based on patch 3.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

c6e03dbe

net: mana: Support hibernation and kexec · 635096a8

Dexuan Cui authored Oct 29, 2021



Implement the suspend/resume/shutdown callbacks for hibernation/kexec.

Add mana_gd_setup() and mana_gd_cleanup() for some common code, and
use them in the mand_gd_* callbacks.

Reuse mana_probe/remove() for the hibernation path.

Signed-off-by: Dexuan Cui <decui@microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

635096a8

net: mana: Improve the HWC error handling · 62ea8b77

Dexuan Cui authored Oct 29, 2021



Currently when the HWC creation fails, the error handling is flawed,
e.g. if mana_hwc_create_channel() -> mana_hwc_establish_channel() fails,
the resources acquired in mana_hwc_init_queues() is not released.

Enhance mana_hwc_destroy_channel() to do the proper cleanup work and
call it accordingly.

Signed-off-by: Dexuan Cui <decui@microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

62ea8b77

net: mana: Report OS info to the PF driver · 3c37f357

Dexuan Cui authored Oct 29, 2021



The PF driver might use the OS info for statistical purposes.

Signed-off-by: Dexuan Cui <decui@microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

3c37f357

net: mana: Fix the netdev_err()'s vPort argument in mana_init_port() · 6c7ea696

Dexuan Cui authored Oct 29, 2021



Use the correct port index rather than 0.

Signed-off-by: Dexuan Cui <decui@microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6c7ea696

Merge branch 'mptcp-selftests' · 986d2e3d

David S. Miller authored Nov 01, 2021



Mat Martineau says:

====================
mptcp: Some selftest improvements

Here are a couple of selftest changes for MPTCP.

Patch 1 fixes a mistake where the wrong protocol (TCP vs MPTCP) could be
requested on the listening socket in some link failure tests.

Patch 2 refactors the simulataneous flow tests to improve timing
accuracy and give more consistent results.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

986d2e3d

selftests: mptcp: more stable simult_flows tests · b6ab64b0

Paolo Abeni authored Oct 29, 2021



Currently the simult_flows.sh self-tests are not very stable,
especially when running on slow VMs.

The tests measure runtime for transfers on multiple subflows
and check that the time is near the theoretical maximum.

The current test infra introduces a bit of jitter in test
runtime, due to multiple explicit delays. Additionally the
runtime is measured by the shell script wrapper. On a slow
VM, the script overhead is measurable and subject to relevant
jitter.

One solution to make the test more stable would be adding more
slack to the expected time; that could possibly hide real
regressions. Instead move the measurement inside the command
doing the transfer, and drop most unneeded sleeps.

Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b6ab64b0

selftests: mptcp: fix proto type in link_failure tests · 7c909a98

Geliang Tang authored Oct 29, 2021

In listener_ns, we should pass srv_proto argument to mptcp_connect command,
not cl_proto.

Fixes: 7d1e6f16

 ("selftests: mptcp: add testcase for active-back")
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

7c909a98

nfp: flower: Allow ipv6gretap interface for offloading · f7536ffb

Yu Xiao authored Oct 29, 2021



The tunnel_type check only allows for "netif_is_gretap", but for
OVS the port is actually "netif_is_ip6gretap" when setting up GRE
for ipv6, which means offloading request was rejected before.

Therefore, adding "netif_is_ip6gretap" allow ipv6gretap interface
for offloading.

Signed-off-by: Yu Xiao <yu.xiao@corigine.com>
Signed-off-by: Louis Peens <louis.peens@corigine.com>
Signed-off-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f7536ffb

net: dsa: populate supported_interfaces member · c07c6e8e

Marek Behún authored Oct 28, 2021



Add a new DSA switch operation, phylink_get_interfaces, which should
fill in which PHY_INTERFACE_MODE_* are supported by given port.

Use this before phylink_create() to fill phylinks supported_interfaces
member, allowing phylink to determine which PHY_INTERFACE_MODEs are
supported.

Signed-off-by: Marek Behún <kabel@kernel.org>
[tweaked patch and description to add more complete support -- rmk]
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>

c07c6e8e

Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue · ebed1cf5

David S. Miller authored Nov 01, 2021



Tony Nguyen says:

====================
100GbE Intel Wired LAN Driver Updates 2021-10-29

This series contains updates to ice and iavf drivers and virtchnl header
file.

Brett removes vlan_promisc argument from a function call for ice driver.
In the virtchnl header file he removes an unused, reserved define and
converts raw value defines to instead use the BIT macro.

Marcin adds syncing of MAC addresses when creating switchdev VFs to
remove error messages on link up and stops showing buffer information
for port representors to remove duplicated entries being displayed for
ice driver.

Karen introduces a helper to go from pci_dev to iavf_adapter in the
iavf driver.

Przemyslaw fixes an issue where iavf was attempting to free IRQs before
calling disable.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

ebed1cf5

Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next · 06f1ecd4

David S. Miller authored Nov 01, 2021



Steffen Klassert says:

====================
pull request (net-next): ipsec-next 2021-10-30

Just two minor changes this time:

1) Remove some superfluous header files from xfrm4_tunnel.c
   From Mianhan Liu.

2) Simplify some error checks in xfrm_input().
   From luo penghao.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

06f1ecd4