Skip to content
  1. Mar 22, 2022
    • Vladimir Oltean's avatar
      net: dsa: fix panic on shutdown if multi-chip tree failed to probe · 8fd36358
      Vladimir Oltean authored
      DSA probing is atypical because a tree of devices must probe all at
      once, so out of N switches which call dsa_tree_setup_routing_table()
      during probe, for (N - 1) of them, "complete" will return false and they
      will exit probing early. The Nth switch will set up the whole tree on
      their behalf.
      
      The implication is that for (N - 1) switches, the driver binds to the
      device successfully, without doing anything. When the driver is bound,
      the ->shutdown() method may run. But if the Nth switch has failed to
      initialize the tree, there is nothing to do for the (N - 1) driver
      instances, since the slave devices have not been created, etc. Moreover,
      dsa_switch_shutdown() expects that the calling @ds has been in fact
      initialized, so it jumps at dereferencing the various data structures,
      which is incorrect.
      
      Avoid the ensuing NULL pointer dereferences by simply checking whether
      the Nth switch has previously set "ds->setup = true" for the switch
      which is currently shutting down. The entire setup is serialized under
      dsa2_mutex which we already hold.
      
      Fixes: 0650bf52
      
       ("net: dsa: be compatible with masters which unregister on shutdown")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Link: https://lore.kernel.org/r/20220318195443.275026-1-vladimir.oltean@nxp.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      8fd36358
    • Aaron Conole's avatar
      openvswitch: always update flow key after nat · 60b44ca6
      Aaron Conole authored
      During NAT, a tuple collision may occur.  When this happens, openvswitch
      will make a second pass through NAT which will perform additional packet
      modification.  This will update the skb data, but not the flow key that
      OVS uses.  This means that future flow lookups, and packet matches will
      have incorrect data.  This has been supported since
      5d50aa83 ("openvswitch: support asymmetric conntrack").
      
      That commit failed to properly update the sw_flow_key attributes, since
      it only called the ovs_ct_nat_update_key once, rather than each time
      ovs_ct_nat_execute was called.  As these two operations are linked, the
      ovs_ct_nat_execute() function should always make sure that the
      sw_flow_key is updated after a successful call through NAT infrastructure.
      
      Fixes: 5d50aa83
      
       ("openvswitch: support asymmetric conntrack")
      Cc: Dumitru Ceara <dceara@redhat.com>
      Cc: Numan Siddique <nusiddiq@redhat.com>
      Signed-off-by: default avatarAaron Conole <aconole@redhat.com>
      Acked-by: default avatarEelco Chaudron <echaudro@redhat.com>
      Link: https://lore.kernel.org/r/20220318124319.3056455-1-aconole@redhat.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      60b44ca6
    • Jakub Kicinski's avatar
      tcp: ensure PMTU updates are processed during fastopen · ed0c99dc
      Jakub Kicinski authored
      tp->rx_opt.mss_clamp is not populated, yet, during TFO send so we
      rise it to the local MSS. tp->mss_cache is not updated, however:
      
      tcp_v6_connect():
        tp->rx_opt.mss_clamp = IPV6_MIN_MTU - headers;
        tcp_connect():
           tcp_connect_init():
             tp->mss_cache = min(mtu, tp->rx_opt.mss_clamp)
           tcp_send_syn_data():
             tp->rx_opt.mss_clamp = tp->advmss
      
      After recent fixes to ICMPv6 PTB handling we started dropping
      PMTU updates higher than tp->mss_cache. Because of the stale
      tp->mss_cache value PMTU updates during TFO are always dropped.
      
      Thanks to Wei for helping zero in on the problem and the fix!
      
      Fixes: c7bb4b89
      
       ("ipv6: tcp: drop silly ICMPv6 packet too big messages")
      Reported-by: default avatarAndre Nash <alnash@fb.com>
      Reported-by: default avatarNeil Spring <ntspring@fb.com>
      Reviewed-by: default avatarWei Wang <weiwan@google.com>
      Acked-by: default avatarYuchung Cheng <ycheng@google.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20220321165957.1769954-1-kuba@kernel.org
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ed0c99dc
    • Jeremy Linton's avatar
      net: bcmgenet: Use stronger register read/writes to assure ordering · 8d3ea3d4
      Jeremy Linton authored
      GCC12 appears to be much smarter about its dependency tracking and is
      aware that the relaxed variants are just normal loads and stores and
      this is causing problems like:
      
      [  210.074549] ------------[ cut here ]------------
      [  210.079223] NETDEV WATCHDOG: enabcm6e4ei0 (bcmgenet): transmit queue 1 timed out
      [  210.086717] WARNING: CPU: 1 PID: 0 at net/sched/sch_generic.c:529 dev_watchdog+0x234/0x240
      [  210.095044] Modules linked in: genet(E) nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat]
      [  210.146561] ACPI CPPC: PCC check channel failed for ss: 0. ret=-110
      [  210.146927] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G            E     5.17.0-rc7G12+ #58
      [  210.153226] CPPC Cpufreq:cppc_scale_freq_workfn: failed to read perf counters
      [  210.161349] Hardware name: Raspberry Pi Foundation Raspberry Pi 4 Model B/Raspberry Pi 4 Model B, BIOS EDK2-DEV 02/08/2022
      [  210.161353] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
      [  210.161358] pc : dev_watchdog+0x234/0x240
      [  210.161364] lr : dev_watchdog+0x234/0x240
      [  210.161368] sp : ffff8000080a3a40
      [  210.161370] x29: ffff8000080a3a40 x28: ffffcd425af87000 x27: ffff8000080a3b20
      [  210.205150] x26: ffffcd425aa00000 x25: 0000000000000001 x24: ffffcd425af8ec08
      [  210.212321] x23: 0000000000000100 x22: ffffcd425af87000 x21: ffff55b142688000
      [  210.219491] x20: 0000000000000001 x19: ffff55b1426884c8 x18: ffffffffffffffff
      [  210.226661] x17: 64656d6974203120 x16: 0000000000000001 x15: 6d736e617274203a
      [  210.233831] x14: 2974656e65676d63 x13: ffffcd4259c300d8 x12: ffffcd425b07d5f0
      [  210.241001] x11: 00000000ffffffff x10: ffffcd425b07d5f0 x9 : ffffcd4258bdad9c
      [  210.248171] x8 : 00000000ffffdfff x7 : 000000000000003f x6 : 0000000000000000
      [  210.255341] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000001000
      [  210.262511] x2 : 0000000000001000 x1 : 0000000000000005 x0 : 0000000000000044
      [  210.269682] Call trace:
      [  210.272133]  dev_watchdog+0x234/0x240
      [  210.275811]  call_timer_fn+0x3c/0x15c
      [  210.279489]  __run_timers.part.0+0x288/0x310
      [  210.283777]  run_timer_softirq+0x48/0x80
      [  210.287716]  __do_softirq+0x128/0x360
      [  210.291392]  __irq_exit_rcu+0x138/0x140
      [  210.295243]  irq_exit_rcu+0x1c/0x30
      [  210.298745]  el1_interrupt+0x38/0x54
      [  210.302334]  el1h_64_irq_handler+0x18/0x24
      [  210.306445]  el1h_64_irq+0x7c/0x80
      [  210.309857]  arch_cpu_idle+0x18/0x2c
      [  210.313445]  default_idle_call+0x4c/0x140
      [  210.317470]  cpuidle_idle_call+0x14c/0x1a0
      [  210.321584]  do_idle+0xb0/0x100
      [  210.324737]  cpu_startup_entry+0x30/0x8c
      [  210.328675]  secondary_start_kernel+0xe4/0x110
      [  210.333138]  __secondary_switched+0x94/0x98
      
      The assumption when these were relaxed seems to be that device memory
      would be mapped non reordering, and that other constructs
      (spinlocks/etc) would provide the barriers to assure that packet data
      and in memory rings/queues were ordered with respect to device
      register reads/writes. This itself seems a bit sketchy, but the real
      problem with GCC12 is that it is moving the actual reads/writes around
      at will as though they were independent operations when in truth they
      are not, but the compiler can't know that. When looking at the
      assembly dumps for many of these routines its possible to see very
      clean, but not strictly in program order operations occurring as the
      compiler would be free to do if these weren't actually register
      reads/write operations.
      
      Its possible to suppress the timeout with a liberal bit of dma_mb()'s
      sprinkled around but the device still seems unable to reliably
      send/receive data. A better plan is to use the safer readl/writel
      everywhere.
      
      Since this partially reverts an older commit, which notes the use of
      the relaxed variants for performance reasons. I would suggest that
      any performance problems with this commit are targeted at relaxing only
      the performance critical code paths after assuring proper barriers.
      
      Fixes: 69d2ea9c
      
       ("net: bcmgenet: Use correct I/O accessors")
      Reported-by: default avatarPeter Robinson <pbrobinson@gmail.com>
      Signed-off-by: default avatarJeremy Linton <jeremy.linton@arm.com>
      Acked-by: default avatarPeter Robinson <pbrobinson@gmail.com>
      Tested-by: default avatarPeter Robinson <pbrobinson@gmail.com>
      Acked-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Link: https://lore.kernel.org/r/20220310045358.224350-1-jeremy.linton@arm.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      8d3ea3d4
  2. Mar 21, 2022
    • David S. Miller's avatar
      Merge branch 'ax25-fixes' · ed32641e
      David S. Miller authored
      
      
      Duoming Zhou says:
      
      ====================
      Fix refcount leak and NPD bugs in ax25
      
      The first patch fixes refcount leak in ax25 that could cause
      ax25-ex-connected-session-now-listening-state-bug.
      
      The second patch fixes NPD bugs in ax25 timers.
      ====================
      
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ed32641e
    • Duoming Zhou's avatar
      ax25: Fix NULL pointer dereferences in ax25 timers · fc6d01ff
      Duoming Zhou authored
      The previous commit 7ec02f5a ("ax25: fix NPD bug in ax25_disconnect")
      move ax25_disconnect into lock_sock() in order to prevent NPD bugs. But
      there are race conditions that may lead to null pointer dereferences in
      ax25_heartbeat_expiry(), ax25_t1timer_expiry(), ax25_t2timer_expiry(),
      ax25_t3timer_expiry() and ax25_idletimer_expiry(), when we use
      ax25_kill_by_device() to detach the ax25 device.
      
      One of the race conditions that cause null pointer dereferences can be
      shown as below:
      
            (Thread 1)                    |      (Thread 2)
      ax25_connect()                      |
       ax25_std_establish_data_link()     |
        ax25_start_t1timer()              |
         mod_timer(&ax25->t1timer,..)     |
                                          | ax25_kill_by_device()
         (wait a time)                    |  ...
                                          |  s->ax25_dev = NULL; //(1)
         ax25_t1timer_expiry()            |
          ax25->ax25_dev->values[..] //(2)|  ...
           ...                            |
      
      We set null to ax25_cb->ax25_dev in position (1) and dereference
      the null pointer in position (2).
      
      The corresponding fail log is shown below:
      ===============================================================
      BUG: kernel NULL pointer dereference, address: 0000000000000050
      CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.17.0-rc6-00794-g45690b7d0
      RIP: 0010:ax25_t1timer_expiry+0x12/0x40
      ...
      Call Trace:
       call_timer_fn+0x21/0x120
       __run_timers.part.0+0x1ca/0x250
       run_timer_softirq+0x2c/0x60
       __do_softirq+0xef/0x2f3
       irq_exit_rcu+0xb6/0x100
       sysvec_apic_timer_interrupt+0xa2/0xd0
      ...
      
      This patch moves ax25_disconnect() before s->ax25_dev = NULL
      and uses del_timer_sync() to delete timers in ax25_disconnect().
      If ax25_disconnect() is called by ax25_kill_by_device() or
      ax25->ax25_dev is NULL, the reason in ax25_disconnect() will be
      equal to ENETUNREACH, it will wait all timers to stop before we
      set null to s->ax25_dev in ax25_kill_by_device().
      
      Fixes: 7ec02f5a
      
       ("ax25: fix NPD bug in ax25_disconnect")
      Signed-off-by: default avatarDuoming Zhou <duoming@zju.edu.cn>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fc6d01ff
    • Duoming Zhou's avatar
      ax25: Fix refcount leaks caused by ax25_cb_del() · 9fd75b66
      Duoming Zhou authored
      The previous commit d01ffb9e ("ax25: add refcount in ax25_dev to
      avoid UAF bugs") and commit feef318c ("ax25: fix UAF bugs of
      net_device caused by rebinding operation") increase the refcounts of
      ax25_dev and net_device in ax25_bind() and decrease the matching refcounts
      in ax25_kill_by_device() in order to prevent UAF bugs, but there are
      reference count leaks.
      
      The root cause of refcount leaks is shown below:
      
           (Thread 1)                      |      (Thread 2)
      ax25_bind()                          |
       ...                                 |
       ax25_addr_ax25dev()                 |
        ax25_dev_hold()   //(1)            |
        ...                                |
       dev_hold_track()   //(2)            |
       ...                                 | ax25_destroy_socket()
                                           |  ax25_cb_del()
                                           |   ...
                                           |   hlist_del_init() //(3)
                                           |
                                           |
           (Thread 3)                      |
      ax25_kill_by_device()                |
       ...                                 |
       ax25_for_each(s, &ax25_list) {      |
        if (s->ax25_dev == ax25_dev) //(4) |
         ...                               |
      
      Firstly, we use ax25_bind() to increase the refcount of ax25_dev in
      position (1) and increase the refcount of net_device in position (2).
      Then, we use ax25_cb_del() invoked by ax25_destroy_socket() to delete
      ax25_cb in hlist in position (3) before calling ax25_kill_by_device().
      Finally, the decrements of refcounts in ax25_kill_by_device() will not
      be executed, because no s->ax25_dev equals to ax25_dev in position (4).
      
      This patch adds decrements of refcounts in ax25_release() and use
      lock_sock() to do synchronization. If refcounts decrease in ax25_release(),
      the decrements of refcounts in ax25_kill_by_device() will not be
      executed and vice versa.
      
      Fixes: d01ffb9e ("ax25: add refcount in ax25_dev to avoid UAF bugs")
      Fixes: 87563a04 ("ax25: fix reference count leaks of ax25_dev")
      Fixes: feef318c
      
       ("ax25: fix UAF bugs of net_device caused by rebinding operation")
      Reported-by: default avatarThomas Osterried <thomas@osterried.de>
      Signed-off-by: default avatarDuoming Zhou <duoming@zju.edu.cn>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9fd75b66
  3. Mar 19, 2022
    • Petr Machata's avatar
      af_netlink: Fix shift out of bounds in group mask calculation · 0caf6d99
      Petr Machata authored
      When a netlink message is received, netlink_recvmsg() fills in the address
      of the sender. One of the fields is the 32-bit bitfield nl_groups, which
      carries the multicast group on which the message was received. The least
      significant bit corresponds to group 1, and therefore the highest group
      that the field can represent is 32. Above that, the UB sanitizer flags the
      out-of-bounds shift attempts.
      
      Which bits end up being set in such case is implementation defined, but
      it's either going to be a wrong non-zero value, or zero, which is at least
      not misleading. Make the latter choice deterministic by always setting to 0
      for higher-numbered multicast groups.
      
      To get information about membership in groups >= 32, userspace is expected
      to use nl_pktinfo control messages[0], which are enabled by NETLINK_PKTINFO
      socket option.
      [0] https://lwn.net/Articles/147608/
      
      The way to trigger this issue is e.g. through monitoring the BRVLAN group:
      
      	# bridge monitor vlan &
      	# ip link add name br type bridge
      
      Which produces the following citation:
      
      	UBSAN: shift-out-of-bounds in net/netlink/af_netlink.c:162:19
      	shift exponent 32 is too large for 32-bit type 'int'
      
      Fixes: f7fa9b10
      
       ("[NETLINK]: Support dynamic number of multicast groups per netlink family")
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Link: https://lore.kernel.org/r/2bef6aabf201d1fc16cca139a744700cff9dcb04.1647527635.git.petrm@nvidia.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      0caf6d99
    • Yonglong Li's avatar
      mptcp: Fix crash due to tcp_tsorted_anchor was initialized before release skb · 3ef3905a
      Yonglong Li authored
      Got crash when doing pressure test of mptcp:
      
      ===========================================================================
      dst_release: dst:ffffa06ce6e5c058 refcnt:-1
      kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
      BUG: unable to handle kernel paging request at ffffa06ce6e5c058
      PGD 190a01067 P4D 190a01067 PUD 43fffb067 PMD 22e403063 PTE 8000000226e5c063
      Oops: 0011 [#1] SMP PTI
      CPU: 7 PID: 7823 Comm: kworker/7:0 Kdump: loaded Tainted: G            E
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.2.1 04/01/2014
      Call Trace:
       ? skb_release_head_state+0x68/0x100
       ? skb_release_all+0xe/0x30
       ? kfree_skb+0x32/0xa0
       ? mptcp_sendmsg_frag+0x57e/0x750
       ? __mptcp_retrans+0x21b/0x3c0
       ? __switch_to_asm+0x35/0x70
       ? mptcp_worker+0x25e/0x320
       ? process_one_work+0x1a7/0x360
       ? worker_thread+0x30/0x390
       ? create_worker+0x1a0/0x1a0
       ? kthread+0x112/0x130
       ? kthread_flush_work_fn+0x10/0x10
       ? ret_from_fork+0x35/0x40
      ===========================================================================
      
      In __mptcp_alloc_tx_skb skb was allocated and skb->tcp_tsorted_anchor will
      be initialized, in under memory pressure situation sk_wmem_schedule will
      return false and then kfree_skb. In this case skb->_skb_refdst is not null
      because_skb_refdst and tcp_tsorted_anchor are stored in the same mem, and
      kfree_skb will try to release dst and cause crash.
      
      Fixes: f70cad10
      
       ("mptcp: stop relying on tcp_tx_skb_cache")
      Reviewed-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarYonglong Li <liyonglong@chinatelecom.cn>
      Signed-off-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Link: https://lore.kernel.org/r/20220317220953.426024-1-mathew.j.martineau@linux.intel.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      3ef3905a
    • Jakub Kicinski's avatar
      Merge branch 'ipv4-handle-tos-and-scope-properly-for-icmp-redirects-and-pmtu-updates' · 03e2777c
      Jakub Kicinski authored
      Guillaume Nault says:
      
      ====================
      ipv4: Handle TOS and scope properly for ICMP redirects and PMTU updates
      
      ICMPv4 PMTU and redirect handlers didn't properly initialise the
      struct flowi4 they used for route lookups:
      
        * ECN bits sometimes weren't cleared from ->flowi4_tos.
        * The RTO_ONLINK flag wasn't taken into account for ->flowi4_scope.
      
      In some special cases, this resulted in ICMP redirects and PMTU updates
      not being taken into account because fib_lookup() couldn't retrieve the
      correct route.
      ====================
      
      Link: https://lore.kernel.org/r/cover.1647519748.git.gnault@redhat.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      03e2777c
    • Guillaume Nault's avatar
      selftest: net: Test IPv4 PMTU exceptions with DSCP and ECN · ec730c3e
      Guillaume Nault authored
      
      
      Add two tests to pmtu.sh, for verifying that PMTU exceptions get
      properly created for routes that don't belong to the main table.
      
      A fib-rule based on the packet's DSCP field is used to jump to the
      correct table. ECN shouldn't interfere with this process, so each test
      has two components: one that only sets DSCP and one that sets both DSCP
      and ECN.
      
      One of the test triggers PMTU exceptions using ICMP Echo Requests, the
      other using UDP packets (to test different handlers in the kernel).
      
      A few adjustments are necessary in the rest of the script to allow
      policy routing scenarios:
      
        * Add global variable rt_table that allows setup_routing_*() to
          add routes to a specific routing table. By default rt_table is set
          to "main", so existing tests don't need to be modified.
      
        * Another global variable, policy_mark, is used to define which
          dsfield value is used for policy routing. This variable has no
          effect on tests that don't use policy routing.
      
        * The UDP version of the test uses socat. So cleanup() now also need
          to kill socat PIDs.
      
        * route_get_dst_pmtu_from_exception() and route_get_dst_exception()
          now take an optional third argument specifying the dsfield. If
          not specified, 0 is used, so existing users don't need to be
          modified.
      
      Signed-off-by: default avatarGuillaume Nault <gnault@redhat.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ec730c3e
    • Guillaume Nault's avatar
      ipv4: Fix route lookups when handling ICMP redirects and PMTU updates · 544b4dd5
      Guillaume Nault authored
      The PMTU update and ICMP redirect helper functions initialise their fl4
      variable with either __build_flow_key() or build_sk_flow_key(). These
      initialisation functions always set ->flowi4_scope with
      RT_SCOPE_UNIVERSE and might set the ECN bits of ->flowi4_tos. This is
      not a problem when the route lookup is later done via
      ip_route_output_key_hash(), which properly clears the ECN bits from
      ->flowi4_tos and initialises ->flowi4_scope based on the RTO_ONLINK
      flag. However, some helpers call fib_lookup() directly, without
      sanitising the tos and scope fields, so the route lookup can fail and,
      as a result, the ICMP redirect or PMTU update aren't taken into
      account.
      
      Fix this by extracting the ->flowi4_tos and ->flowi4_scope sanitisation
      code into ip_rt_fix_tos(), then use this function in handlers that call
      fib_lookup() directly.
      
      Note 1: We can't sanitise ->flowi4_tos and ->flowi4_scope in a central
      place (like __build_flow_key() or flowi4_init_output()), because
      ip_route_output_key_hash() expects non-sanitised values. When called
      with sanitised values, it can erroneously overwrite RT_SCOPE_LINK with
      RT_SCOPE_UNIVERSE in ->flowi4_scope. Therefore we have to be careful to
      sanitise the values only for those paths that don't call
      ip_route_output_key_hash().
      
      Note 2: The problem is mostly about sanitising ->flowi4_tos. Having
      ->flowi4_scope initialised with RT_SCOPE_UNIVERSE instead of
      RT_SCOPE_LINK probably wasn't really a problem: sockets with the
      SOCK_LOCALROUTE flag set (those that'd result in RTO_ONLINK being set)
      normally shouldn't receive ICMP redirects or PMTU updates.
      
      Fixes: 4895c771
      
       ("ipv4: Add FIB nexthop exceptions.")
      Signed-off-by: default avatarGuillaume Nault <gnault@redhat.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      544b4dd5
    • Jakub Kicinski's avatar
      Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf · 6bd0c76b
      Jakub Kicinski authored
      Daniel Borkmann says:
      
      ====================
      pull-request: bpf 2022-03-18
      
      We've added 2 non-merge commits during the last 18 day(s) which contain
      a total of 2 files changed, 50 insertions(+), 20 deletions(-).
      
      The main changes are:
      
      1) Fix a race in XSK socket teardown code that can lead to a NULL pointer
         dereference, from Magnus.
      
      2) Small MAINTAINERS doc update to remove Lorenz from sockmap, from Lorenz.
      
      * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
        xsk: Fix race at socket teardown
        bpf: Remove Lorenz Bauer from L7 BPF maintainers
      ====================
      
      Link: https://lore.kernel.org/r/20220318152418.28638-1-daniel@iogearbox.net
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      6bd0c76b
  4. Mar 18, 2022
    • David S. Miller's avatar
      Merge branch 'af_unix-OOB-fixes' · 9905eed4
      David S. Miller authored
      Kuniyuki Iwashima says:
      
      ====================
      af_unix: Fix some OOB implementation.
      
      This series fixes some data-races and adds a missing feature around the
      commit 314001f0 ("af_unix: Add OOB support").
      
      Changelog:
        - v3:
          - Add the first patch
      
        - v2: https://lore.kernel.org/netdev/20220315054801.72035-1-kuniyu@amazon.co.jp/
          - Add READ_ONCE() to avoid a race reported by KCSAN (Eric)
          - Add IS_ENABLED(CONFIG_AF_UNIX_OOB) (Shoaib)
      
        - v1: https://lore.kernel.org/netdev/20220314052110.53634-1-kuniyu@amazon.co.jp/
      
      
      ====================
      
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9905eed4
    • Kuniyuki Iwashima's avatar
      af_unix: Support POLLPRI for OOB. · d9a232d4
      Kuniyuki Iwashima authored
      The commit 314001f0 ("af_unix: Add OOB support") introduced OOB for
      AF_UNIX, but it lacks some changes for POLLPRI.  Let's add the missing
      piece.
      
      In the selftest, normal datagrams are sent followed by OOB data, so this
      commit replaces `POLLIN | POLLPRI` with just `POLLPRI` in the first test
      case.
      
      Fixes: 314001f0
      
       ("af_unix: Add OOB support")
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.co.jp>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d9a232d4
    • Kuniyuki Iwashima's avatar
      af_unix: Fix some data-races around unix_sk(sk)->oob_skb. · e82025c6
      Kuniyuki Iwashima authored
      Out-of-band data automatically places a "mark" showing wherein the
      sequence the out-of-band data would have been.  If the out-of-band data
      implies cancelling everything sent so far, the "mark" is helpful to flush
      them.  When the socket's read pointer reaches the "mark", the ioctl() below
      sets a non zero value to the arg `atmark`:
      
      The out-of-band data is queued in sk->sk_receive_queue as well as ordinary
      data and also saved in unix_sk(sk)->oob_skb.  It can be used to test if the
      head of the receive queue is the out-of-band data meaning the socket is at
      the "mark".
      
      While testing that, unix_ioctl() reads unix_sk(sk)->oob_skb locklessly.
      Thus, all accesses to oob_skb need some basic protection to avoid
      load/store tearing which KCSAN detects when these are called concurrently:
      
        - ioctl(fd_a, SIOCATMARK, &atmark, sizeof(atmark))
        - send(fd_b_connected_to_a, buf, sizeof(buf), MSG_OOB)
      
      BUG: KCSAN: data-race in unix_ioctl / unix_stream_sendmsg
      
      write to 0xffff888003d9cff0 of 8 bytes by task 175 on cpu 1:
       unix_stream_sendmsg (net/unix/af_unix.c:2087 net/unix/af_unix.c:2191)
       sock_sendmsg (net/socket.c:705 net/socket.c:725)
       __sys_sendto (net/socket.c:2040)
       __x64_sys_sendto (net/socket.c:2048)
       do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
       entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113)
      
      read to 0xffff888003d9cff0 of 8 bytes by task 176 on cpu 0:
       unix_ioctl (net/unix/af_unix.c:3101 (discriminator 1))
       sock_do_ioctl (net/socket.c:1128)
       sock_ioctl (net/socket.c:1242)
       __x64_sys_ioctl (fs/ioctl.c:52 fs/ioctl.c:874 fs/ioctl.c:860 fs/ioctl.c:860)
       do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
       entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113)
      
      value changed: 0xffff888003da0c00 -> 0xffff888003da0d00
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 0 PID: 176 Comm: unix_race_oob_i Not tainted 5.17.0-rc5-59529-g83dc4c2af682 #12
      Hardware name: Red Hat KVM, BIOS 1.11.0-2.amzn2 04/01/2014
      
      Fixes: 314001f0
      
       ("af_unix: Add OOB support")
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.co.jp>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e82025c6
    • Sukadev Bhattiprolu's avatar
      ibmvnic: fix race between xmit and reset · 4219196d
      Sukadev Bhattiprolu authored
      There is a race between reset and the transmit paths that can lead to
      ibmvnic_xmit() accessing an scrq after it has been freed in the reset
      path. It can result in a crash like:
      
      	Kernel attempted to read user page (0) - exploit attempt? (uid: 0)
      	BUG: Kernel NULL pointer dereference on read at 0x00000000
      	Faulting instruction address: 0xc0080000016189f8
      	Oops: Kernel access of bad area, sig: 11 [#1]
      	...
      	NIP [c0080000016189f8] ibmvnic_xmit+0x60/0xb60 [ibmvnic]
      	LR [c000000000c0046c] dev_hard_start_xmit+0x11c/0x280
      	Call Trace:
      	[c008000001618f08] ibmvnic_xmit+0x570/0xb60 [ibmvnic] (unreliable)
      	[c000000000c0046c] dev_hard_start_xmit+0x11c/0x280
      	[c000000000c9cfcc] sch_direct_xmit+0xec/0x330
      	[c000000000bfe640] __dev_xmit_skb+0x3a0/0x9d0
      	[c000000000c00ad4] __dev_queue_xmit+0x394/0x730
      	[c008000002db813c] __bond_start_xmit+0x254/0x450 [bonding]
      	[c008000002db8378] bond_start_xmit+0x40/0xc0 [bonding]
      	[c000000000c0046c] dev_hard_start_xmit+0x11c/0x280
      	[c000000000c00ca4] __dev_queue_xmit+0x564/0x730
      	[c000000000cf97e0] neigh_hh_output+0xd0/0x180
      	[c000000000cfa69c] ip_finish_output2+0x31c/0x5c0
      	[c000000000cfd244] __ip_queue_xmit+0x194/0x4f0
      	[c000000000d2a3c4] __tcp_transmit_skb+0x434/0x9b0
      	[c000000000d2d1e0] __tcp_retransmit_skb+0x1d0/0x6a0
      	[c000000000d2d984] tcp_retransmit_skb+0x34/0x130
      	[c000000000d310e8] tcp_retransmit_timer+0x388/0x6d0
      	[c000000000d315ec] tcp_write_timer_handler+0x1bc/0x330
      	[c000000000d317bc] tcp_write_timer+0x5c/0x200
      	[c000000000243270] call_timer_fn+0x50/0x1c0
      	[c000000000243704] __run_timers.part.0+0x324/0x460
      	[c000000000243894] run_timer_softirq+0x54/0xa0
      	[c000000000ea713c] __do_softirq+0x15c/0x3e0
      	[c000000000166258] __irq_exit_rcu+0x158/0x190
      	[c000000000166420] irq_exit+0x20/0x40
      	[c00000000002853c] timer_interrupt+0x14c/0x2b0
      	[c000000000009a00] decrementer_common_virt+0x210/0x220
      	--- interrupt: 900 at plpar_hcall_norets_notrace+0x18/0x2c
      
      The immediate cause of the crash is the access of tx_scrq in the following
      snippet during a reset, where the tx_scrq can be either NULL or an address
      that will soon be invalid:
      
      	ibmvnic_xmit()
      	{
      		...
      		tx_scrq = adapter->tx_scrq[queue_num];
      		txq = netdev_get_tx_queue(netdev, queue_num);
      		ind_bufp = &tx_scrq->ind_buf;
      
      		if (test_bit(0, &adapter->resetting)) {
      		...
      	}
      
      But beyond that, the call to ibmvnic_xmit() itself is not safe during a
      reset and the reset path attempts to avoid this by stopping the queue in
      ibmvnic_cleanup(). However just after the queue was stopped, an in-flight
      ibmvnic_complete_tx() could have restarted the queue even as the reset is
      progressing.
      
      Since the queue was restarted we could get a call to ibmvnic_xmit() which
      can then access the bad tx_scrq (or other fields).
      
      We cannot however simply have ibmvnic_complete_tx() check the ->resetting
      bit and skip starting the queue. This can race at the "back-end" of a good
      reset which just restarted the queue but has not cleared the ->resetting
      bit yet. If we skip restarting the queue due to ->resetting being true,
      the queue would remain stopped indefinitely potentially leading to transmit
      timeouts.
      
      IOW ->resetting is too broad for this purpose. Instead use a new flag
      that indicates whether or not the queues are active. Only the open/
      reset paths control when the queues are active. ibmvnic_complete_tx()
      and others wake up the queue only if the queue is marked active.
      
      So we will have:
      	A. reset/open thread in ibmvnic_cleanup() and __ibmvnic_open()
      
      		->resetting = true
      		->tx_queues_active = false
      		disable tx queues
      		...
      		->tx_queues_active = true
      		start tx queues
      
      	B. Tx interrupt in ibmvnic_complete_tx():
      
      		if (->tx_queues_active)
      			netif_wake_subqueue();
      
      To ensure that ->tx_queues_active and state of the queues are consistent,
      we need a lock which:
      
      	- must also be taken in the interrupt path (ibmvnic_complete_tx())
      	- shared across the multiple queues in the adapter (so they don't
      	  become serialized)
      
      Use rcu_read_lock() and have the reset thread synchronize_rcu() after
      updating the ->tx_queues_active state.
      
      While here, consolidate a few boolean fields in ibmvnic_adapter for
      better alignment.
      
      Based on discussions with Brian King and Dany Madden.
      
      Fixes: 7ed5b31f
      
       ("net/ibmvnic: prevent more than one thread from running in reset")
      Reported-by: default avatarVaishnavi Bhat <vaish123@in.ibm.com>
      Signed-off-by: default avatarSukadev Bhattiprolu <sukadev@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4219196d
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf · 4fa331b4
      David S. Miller authored
      
      
      Pablo Neira Ayuso says:
      
      ====================
      Netfilter fixes for net
      
      The following patchset contains Netfilter fixes for net:
      
      1) Fix PPPoE and QinQ with flowtable inet family.
      
      2) Missing register validation in nf_tables.
      
      3) Initialize registers to avoid stack memleak to userspace.
      ====================
      
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4fa331b4
    • Linus Torvalds's avatar
      Merge tag 'net-5.17-final' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 551acdc3
      Linus Torvalds authored
      Pull networking fixes from Jakub Kicinski:
       "Including fixes from netfilter, ipsec, and wireless.
      
        A few last minute revert / disable and fix patches came down from our
        sub-trees. We're not waiting for any fixes at this point.
      
        Current release - regressions:
      
         - Revert "netfilter: nat: force port remap to prevent shadowing
           well-known ports", restore working conntrack on asymmetric paths
      
         - Revert "ath10k: drop beacon and probe response which leak from
           other channel", restore working AP and mesh mode on QCA9984
      
         - eth: intel: fix hang during reboot/shutdown
      
        Current release - new code bugs:
      
         - netfilter: nf_tables: disable register tracking, it needs more work
           to cover all corner cases
      
        Previous releases - regressions:
      
         - ipv6: fix skb_over_panic in __ip6_append_data when (admin-only)
           extension headers get specified
      
         - esp6: fix ESP over TCP/UDP, interpret ipv6_skip_exthdr's return
           value more selectively
      
         - bnx2x: fix driver load failure when FW not present in initrd
      
        Previous releases - always broken:
      
         - vsock: stop destroying unrelated sockets in nested virtualization
      
         - packet: fix slab-out-of-bounds access in packet_recvmsg()
      
        Misc:
      
         - add Paolo Abeni to networking maintainers!"
      
      * tag 'net-5.17-final' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (26 commits)
        iavf: Fix hang during reboot/shutdown
        net: mscc: ocelot: fix backwards compatibility with single-chain tc-flower offload
        net: bcmgenet: skip invalid partial checksums
        bnx2x: fix built-in kernel driver load failure
        net: phy: mscc: Add MODULE_FIRMWARE macros
        net: dsa: Add missing of_node_put() in dsa_port_parse_of
        net: handle ARPHRD_PIMREG in dev_is_mac_header_xmit()
        Revert "ath10k: drop beacon and probe response which leak from other channel"
        hv_netvsc: Add check for kvmalloc_array
        iavf: Fix double free in iavf_reset_task
        ice: destroy flow director filter mutex after releasing VSIs
        ice: fix NULL pointer dereference in ice_update_vsi_tx_ring_stats()
        Add Paolo Abeni to networking maintainers
        atm: eni: Add check for dma_map_single
        net/packet: fix slab-out-of-bounds access in packet_recvmsg()
        net: mdio: mscc-miim: fix duplicate debugfs entry
        net: phy: marvell: Fix invalid comparison in the resume and suspend functions
        esp6: fix check on ipv6_skip_exthdr's return value
        net: dsa: microchip: add spi_device_id tables
        netfilter: nf_tables: disable register tracking
        ...
      551acdc3
    • Linus Torvalds's avatar
      Merge tag 'acpi-5.17-rc9' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · c81801eb
      Linus Torvalds authored
      Pull ACPI fix from Rafael Wysocki:
       "Revert recent commit that caused multiple systems to misbehave due to
        firmware issues"
      
      * tag 'acpi-5.17-rc9' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
        Revert "ACPI: scan: Do not add device IDs from _CID if _HID is not valid"
      c81801eb
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · 2ab99e54
      Linus Torvalds authored
      Merge misc fixes from Andrew Morton:
       "Four patches.
      
        Subsystems affected by this patch series: mm/swap, kconfig, ocfs2, and
        selftests"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
        selftests: vm: fix clang build error multiple output files
        ocfs2: fix crash when initialize filecheck kobj fails
        configs/debug: restore DEBUG_INFO=y for overriding
        mm: swap: get rid of livelock in swapin readahead
      2ab99e54
    • Yosry Ahmed's avatar
      selftests: vm: fix clang build error multiple output files · 1c4debc4
      Yosry Ahmed authored
      When building the vm selftests using clang, some errors are seen due to
      having headers in the compilation command:
      
        clang -Wall -I ../../../../usr/include  -no-pie    gup_test.c ../../../../mm/gup_test.h -lrt -lpthread -o .../tools/testing/selftests/vm/gup_test
        clang: error: cannot specify -o when generating multiple output files
        make[1]: *** [../lib.mk:146: .../tools/testing/selftests/vm/gup_test] Error 1
      
      Rework to add the header files to LOCAL_HDRS before including ../lib.mk,
      since the dependency is evaluated in '$(OUTPUT)/%:%.c $(LOCAL_HDRS)' in
      file lib.mk.
      
      Link: https://lkml.kernel.org/r/20220304000645.1888133-1-yosryahmed@google.com
      
      
      Signed-off-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1c4debc4
    • Joseph Qi's avatar
      ocfs2: fix crash when initialize filecheck kobj fails · 7b0b1332
      Joseph Qi authored
      Once s_root is set, genric_shutdown_super() will be called if
      fill_super() fails.  That means, we will call ocfs2_dismount_volume()
      twice in such case, which can lead to kernel crash.
      
      Fix this issue by initializing filecheck kobj before setting s_root.
      
      Link: https://lkml.kernel.org/r/20220310081930.86305-1-joseph.qi@linux.alibaba.com
      Fixes: 5f483c4a
      
       ("ocfs2: add kobject for online file check")
      Signed-off-by: default avatarJoseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Changwei Ge <gechangwei@live.cn>
      Cc: Gang He <ghe@suse.com>
      Cc: Jun Piao <piaojun@huawei.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7b0b1332
    • Qian Cai's avatar
      configs/debug: restore DEBUG_INFO=y for overriding · 8208257d
      Qian Cai authored
      Previously, I failed to realize that Kees' patch [1] has not been merged
      into the mainline yet, and dropped DEBUG_INFO=y too eagerly from the
      mainline.  As the results, "make debug.config" won't be able to flip
      DEBUG_INFO=n from the existing .config.  This should close the gaps of a
      few weeks before Kees' patch is there, and work regardless of their
      merging status anyway.
      
      Link: https://lore.kernel.org/all/20220125075126.891825-1-keescook@chromium.org/ [1]
      Link: https://lkml.kernel.org/r/20220308153524.8618-1-quic_qiancai@quicinc.com
      
      
      Signed-off-by: default avatarQian Cai <quic_qiancai@quicinc.com>
      Reported-by: default avatarDaniel Thompson <daniel.thompson@linaro.org>
      Reviewed-by: default avatarDaniel Thompson <daniel.thompson@linaro.org>
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8208257d
    • Guo Ziliang's avatar
      mm: swap: get rid of livelock in swapin readahead · 029c4628
      Guo Ziliang authored
      In our testing, a livelock task was found.  Through sysrq printing, same
      stack was found every time, as follows:
      
        __swap_duplicate+0x58/0x1a0
        swapcache_prepare+0x24/0x30
        __read_swap_cache_async+0xac/0x220
        read_swap_cache_async+0x58/0xa0
        swapin_readahead+0x24c/0x628
        do_swap_page+0x374/0x8a0
        __handle_mm_fault+0x598/0xd60
        handle_mm_fault+0x114/0x200
        do_page_fault+0x148/0x4d0
        do_translation_fault+0xb0/0xd4
        do_mem_abort+0x50/0xb0
      
      The reason for the livelock is that swapcache_prepare() always returns
      EEXIST, indicating that SWAP_HAS_CACHE has not been cleared, so that it
      cannot jump out of the loop.  We suspect that the task that clears the
      SWAP_HAS_CACHE flag never gets a chance to run.  We try to lower the
      priority of the task stuck in a livelock so that the task that clears
      the SWAP_HAS_CACHE flag will run.  The results show that the system
      returns to normal after the priority is lowered.
      
      In our testing, multiple real-time tasks are bound to the same core, and
      the task in the livelock is the highest priority task of the core, so
      the livelocked task cannot be preempted.
      
      Although cond_resched() is used by __read_swap_cache_async, it is an
      empty function in the preemptive system and cannot achieve the purpose
      of releasing the CPU.  A high-priority task cannot release the CPU
      unless preempted by a higher-priority task.  But when this task is
      already the highest priority task on this core, other tasks will not be
      able to be scheduled.  So we think we should replace cond_resched() with
      schedule_timeout_uninterruptible(1), schedule_timeout_interruptible will
      call set_current_state first to set the task state, so the task will be
      removed from the running queue, so as to achieve the purpose of giving
      up the CPU and prevent it from running in kernel mode for too long.
      
      (akpm: ugly hack becomes uglier.  But it fixes the issue in a
      backportable-to-stable fashion while we hopefully work on something
      better)
      
      Link: https://lkml.kernel.org/r/20220221111749.1928222-1-cgel.zte@gmail.com
      
      
      Signed-off-by: default avatarGuo Ziliang <guo.ziliang@zte.com.cn>
      Reported-by: default avatarZeal Robot <zealci@zte.com.cn>
      Reviewed-by: default avatarRan Xiaokai <ran.xiaokai@zte.com.cn>
      Reviewed-by: default avatarJiang Xuexin <jiang.xuexin@zte.com.cn>
      Reviewed-by: default avatarYang Yang <yang.yang29@zte.com.cn>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Roger Quadros <rogerq@kernel.org>
      Cc: Ziliang Guo <guo.ziliang@zte.com.cn>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      029c4628
    • Ivan Vecera's avatar
      iavf: Fix hang during reboot/shutdown · b04683ff
      Ivan Vecera authored
      Recent commit 97457801 ("iavf: Add waiting so the port is
      initialized in remove") adds a wait-loop at the beginning of
      iavf_remove() to ensure that port initialization is finished
      prior unregistering net device. This causes a regression
      in reboot/shutdown scenario because in this case callback
      iavf_shutdown() is called and this callback detaches the device,
      makes it down if it is running and sets its state to __IAVF_REMOVE.
      Later shutdown callback of associated PF driver (e.g. ice_shutdown)
      is called. That callback calls among other things sriov_disable()
      that calls indirectly iavf_remove() (see stack trace below).
      As the adapter state is already __IAVF_REMOVE then the mentioned
      loop is end-less and shutdown process hangs.
      
      The patch fixes this by checking adapter's state at the beginning
      of iavf_remove() and skips the rest of the function if the adapter
      is already in remove state (shutdown is in progress).
      
      Reproducer:
      1. Create VF on PF driven by ice or i40e driver
      2. Ensure that the VF is bound to iavf driver
      3. Reboot
      
      [52625.981294] sysrq: SysRq : Show Blocked State
      [52625.988377] task:reboot          state:D stack:    0 pid:17359 ppid:     1 f2
      [52625.996732] Call Trace:
      [52625.999187]  __schedule+0x2d1/0x830
      [52626.007400]  schedule+0x35/0xa0
      [52626.010545]  schedule_hrtimeout_range_clock+0x83/0x100
      [52626.020046]  usleep_range+0x5b/0x80
      [52626.023540]  iavf_remove+0x63/0x5b0 [iavf]
      [52626.027645]  pci_device_remove+0x3b/0xc0
      [52626.031572]  device_release_driver_internal+0x103/0x1f0
      [52626.036805]  pci_stop_bus_device+0x72/0xa0
      [52626.040904]  pci_stop_and_remove_bus_device+0xe/0x20
      [52626.045870]  pci_iov_remove_virtfn+0xba/0x120
      [52626.050232]  sriov_disable+0x2f/0xe0
      [52626.053813]  ice_free_vfs+0x7c/0x340 [ice]
      [52626.057946]  ice_remove+0x220/0x240 [ice]
      [52626.061967]  ice_shutdown+0x16/0x50 [ice]
      [52626.065987]  pci_device_shutdown+0x34/0x60
      [52626.070086]  device_shutdown+0x165/0x1c5
      [52626.074011]  kernel_restart+0xe/0x30
      [52626.077593]  __do_sys_reboot+0x1d2/0x210
      [52626.093815]  do_syscall_64+0x5b/0x1a0
      [52626.097483]  entry_SYSCALL_64_after_hwframe+0x65/0xca
      
      Fixes: 97457801
      
       ("iavf: Add waiting so the port is initialized in remove")
      Signed-off-by: default avatarIvan Vecera <ivecera@redhat.com>
      Link: https://lore.kernel.org/r/20220317104524.2802848-1-ivecera@redhat.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      b04683ff
    • Vladimir Oltean's avatar
      net: mscc: ocelot: fix backwards compatibility with single-chain tc-flower offload · 8e0341ae
      Vladimir Oltean authored
      ACL rules can be offloaded to VCAP IS2 either through chain 0, or, since
      the blamed commit, through a chain index whose number encodes a specific
      PAG (Policy Action Group) and lookup number.
      
      The chain number is translated through ocelot_chain_to_pag() into a PAG,
      and through ocelot_chain_to_lookup() into a lookup number.
      
      The problem with the blamed commit is that the above 2 functions don't
      have special treatment for chain 0. So ocelot_chain_to_pag(0) returns
      filter->pag = 224, which is in fact -32, but the "pag" field is an u8.
      
      So we end up programming the hardware with VCAP IS2 entries having a PAG
      of 224. But the way in which the PAG works is that it defines a subset
      of VCAP IS2 filters which should match on a packet. The default PAG is
      0, and previous VCAP IS1 rules (which we offload using 'goto') can
      modify it. So basically, we are installing filters with a PAG on which
      no packet will ever match. This is the hardware equivalent of adding
      filters to a chain which has no 'goto' to it.
      
      Restore the previous functionality by making ACL filters offloaded to
      chain 0 go to PAG 0 and lookup number 0. The choice of PAG is clearly
      correct, but the choice of lookup number isn't "as before" (which was to
      leave the lookup a "don't care"). However, lookup 0 should be fine,
      since even though there are ACL actions (policers) which have a
      requirement to be used in a specific lookup, that lookup is 0.
      
      Fixes: 226e9cd8
      
       ("net: mscc: ocelot: only install TCAM entries into a specific lookup and PAG")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Link: https://lore.kernel.org/r/20220316192117.2568261-1-vladimir.oltean@nxp.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      8e0341ae
    • Doug Berger's avatar
      net: bcmgenet: skip invalid partial checksums · 0f643c88
      Doug Berger authored
      The RXCHK block will return a partial checksum of 0 if it encounters
      a problem while receiving a packet. Since a 1's complement sum can
      only produce this result if no bits are set in the received data
      stream it is fair to treat it as an invalid partial checksum and
      not pass it up the stack.
      
      Fixes: 81015539
      
       ("net: bcmgenet: use CHECKSUM_COMPLETE for NETIF_F_RXCSUM")
      Signed-off-by: default avatarDoug Berger <opendmb@gmail.com>
      Acked-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Link: https://lore.kernel.org/r/20220317012812.1313196-1-opendmb@gmail.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      0f643c88
    • Manish Chopra's avatar
      bnx2x: fix built-in kernel driver load failure · 424e7834
      Manish Chopra authored
      Commit b7a49f73 ("bnx2x: Utilize firmware 7.13.21.0")
      added request_firmware() logic in probe() which caused
      load failure when firmware file is not present in initrd (below),
      as access to firmware file is not feasible during probe.
      
        Direct firmware load for bnx2x/bnx2x-e2-7.13.15.0.fw failed with error -2
        Direct firmware load for bnx2x/bnx2x-e2-7.13.21.0.fw failed with error -2
      
      This patch fixes this issue by -
      
      1. Removing request_firmware() logic from the probe()
         such that .ndo_open() handle it as it used to handle
         it earlier
      
      2. Given request_firmware() is removed from probe(), so
         driver has to relax FW version comparisons a bit against
         the already loaded FW version (by some other PFs of same
         adapter) to allow different compatible/close enough FWs with which
         multiple PFs may run with (in different environments), as the
         given PF who is in probe flow has no idea now with which firmware
         file version it is going to initialize the device in ndo_open()
      
      Link: https://lore.kernel.org/all/46f2d9d9-ae7f-b332-ddeb-b59802be2bab@molgen.mpg.de/
      
      
      Reported-by: default avatarPaul Menzel <pmenzel@molgen.mpg.de>
      Tested-by: default avatarPaul Menzel <pmenzel@molgen.mpg.de>
      Fixes: b7a49f73
      
       ("bnx2x: Utilize firmware 7.13.21.0")
      Signed-off-by: default avatarManish Chopra <manishc@marvell.com>
      Signed-off-by: default avatarAriel Elior <aelior@marvell.com>
      Link: https://lore.kernel.org/r/20220316214613.6884-1-manishc@marvell.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      424e7834
    • Juerg Haefliger's avatar
      net: phy: mscc: Add MODULE_FIRMWARE macros · f1858c27
      Juerg Haefliger authored
      The driver requires firmware so define MODULE_FIRMWARE so that modinfo
      provides the details.
      
      Fixes: fa164e40
      
       ("net: phy: mscc: split the driver into separate files")
      Signed-off-by: default avatarJuerg Haefliger <juergh@canonical.com>
      Link: https://lore.kernel.org/r/20220316151835.88765-1-juergh@canonical.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      f1858c27
  5. Mar 17, 2022
  6. Mar 16, 2022