Skip to content
  1. Dec 01, 2021
    • Joerg Roedel's avatar
      iommu/amd: Clarify AMD IOMMUv2 initialization messages · fbc0514e
      Joerg Roedel authored
      commit 717e88aa
      
       upstream.
      
      The messages printed on the initialization of the AMD IOMMUv2 driver
      have caused some confusion in the past. Clarify the messages to lower
      the confusion in the future.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJoerg Roedel <jroedel@suse.de>
      Link: https://lore.kernel.org/r/20211123105507.7654-3-joro@8bytes.org
      
      
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fbc0514e
    • Steve French's avatar
      smb3: do not error on fsync when readonly · 5655b8bc
      Steve French authored
      [ Upstream commit 71e6864e
      
       ]
      
      Linux allows doing a flush/fsync on a file open for read-only,
      but the protocol does not allow that.  If the file passed in
      on the flush is read-only try to find a writeable handle for
      the same inode, if that is not possible skip sending the
      fsync call to the server to avoid breaking the apps.
      
      Reported-by: default avatarJulian Sikorski <belegdol@gmail.com>
      Tested-by: default avatarJulian Sikorski <belegdol@gmail.com>
      Suggested-by: default avatarJeremy Allison <jra@samba.org>
      Reviewed-by: default avatarPaulo Alcantara (SUSE) <pc@cjr.nz>
      Signed-off-by: default avatarSteve French <stfrench@microsoft.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      5655b8bc
    • Jeff Layton's avatar
      ceph: properly handle statfs on multifs setups · c380062d
      Jeff Layton authored
      [ Upstream commit 8cfc0c7e ]
      
      ceph_statfs currently stuffs the cluster fsid into the f_fsid field.
      This was fine when we only had a single filesystem per cluster, but now
      that we have multiples we need to use something that will vary between
      them.
      
      Change ceph_statfs to xor each 32-bit chunk of the fsid (aka cluster id)
      into the lower bits of the statfs->f_fsid. Change the lower bits to hold
      the fscid (filesystem ID within the cluster).
      
      That should give us a value that is guaranteed to be unique between
      filesystems within a cluster, and should minimize the chance of
      collisions between mounts of different clusters.
      
      URL: https://tracker.ceph.com/issues/52812
      
      
      Reported-by: default avatarSachin Prabhu <sprabhu@redhat.com>
      Signed-off-by: default avatarJeff Layton <jlayton@kernel.org>
      Reviewed-by: default avatarXiubo Li <xiubli@redhat.com>
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      c380062d
    • Weichao Guo's avatar
      f2fs: set SBI_NEED_FSCK flag when inconsistent node block found · 22423c96
      Weichao Guo authored
      [ Upstream commit 6663b138
      
       ]
      
      Inconsistent node block will cause a file fail to open or read,
      which could make the user process crashes or stucks. Let's mark
      SBI_NEED_FSCK flag to trigger a fix at next fsck time. After
      unlinking the corrupted file, the user process could regenerate
      a new one and work correctly.
      
      Signed-off-by: default avatarWeichao Guo <guoweichao@oppo.com>
      Reviewed-by: default avatarChao Yu <chao@kernel.org>
      Signed-off-by: default avatarJaegeuk Kim <jaegeuk@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      22423c96
    • Mark Rutland's avatar
      sched/scs: Reset task stack state in bringup_cpu() · e6ee7abd
      Mark Rutland authored
      [ Upstream commit dce1ca05 ]
      
      To hot unplug a CPU, the idle task on that CPU calls a few layers of C
      code before finally leaving the kernel. When KASAN is in use, poisoned
      shadow is left around for each of the active stack frames, and when
      shadow call stacks are in use. When shadow call stacks (SCS) are in use
      the task's saved SCS SP is left pointing at an arbitrary point within
      the task's shadow call stack.
      
      When a CPU is offlined than onlined back into the kernel, this stale
      state can adversely affect execution. Stale KASAN shadow can alias new
      stackframes and result in bogus KASAN warnings. A stale SCS SP is
      effectively a memory leak, and prevents a portion of the shadow call
      stack being used. Across a number of hotplug cycles the idle task's
      entire shadow call stack can become unusable.
      
      We previously fixed the KASAN issue in commit:
      
        e1b77c92 ("sched/kasan: remove stale KASAN poison after hotplug")
      
      ... by removing any stale KASAN stack poison immediately prior to
      onlining a CPU.
      
      Subsequently in commit:
      
        f1a0a376 ("sched/core: Initialize the idle task with preemption disabled")
      
      ... the refactoring left the KASAN and SCS cleanup in one-time idle
      thread initialization code rather than something invoked prior to each
      CPU being onlined, breaking both as above.
      
      We fixed SCS (but not KASAN) in commit:
      
        63acd42c ("sched/scs: Reset the shadow stack when idle_task_exit")
      
      ... but as this runs in the context of the idle task being offlined it's
      potentially fragile.
      
      To fix these consistently and more robustly, reset the SCS SP and KASAN
      shadow of a CPU's idle task immediately before we online that CPU in
      bringup_cpu(). This ensures the idle task always has a consistent state
      when it is running, and removes the need to so so when exiting an idle
      task.
      
      Whenever any thread is created, dup_task_struct() will give the task a
      stack which is free of KASAN shadow, and initialize the task's SCS SP,
      so there's no need to specially initialize either for idle thread within
      init_idle(), as this was only necessary to handle hotplug cycles.
      
      I've tested this on arm64 with:
      
      * gcc 11.1.0, defconfig +KASAN_INLINE, KASAN_STACK
      * clang 12.0.0, defconfig +KASAN_INLINE, KASAN_STACK, SHADOW_CALL_STACK
      
      ... offlining and onlining CPUS with:
      
      | while true; do
      |   for C in /sys/devices/system/cpu/cpu*/online; do
      |     echo 0 > $C;
      |     echo 1 > $C;
      |   done
      | done
      
      Fixes: f1a0a376
      
       ("sched/core: Initialize the idle task with preemption disabled")
      Reported-by: default avatarQian Cai <quic_qiancai@quicinc.com>
      Signed-off-by: default avatarMark Rutland <mark.rutland@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      Tested-by: default avatarQian Cai <quic_qiancai@quicinc.com>
      Link: https://lore.kernel.org/lkml/20211115113310.35693-1-mark.rutland@arm.com/
      
      
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      e6ee7abd
    • Arjun Roy's avatar
      tcp: correctly handle increased zerocopy args struct size · 71e38a0c
      Arjun Roy authored
      [ Upstream commit e0fecb28 ]
      
      A prior patch increased the size of struct tcp_zerocopy_receive
      but did not update do_tcp_getsockopt() handling to properly account
      for this.
      
      This patch simply reintroduces content erroneously cut from the
      referenced prior patch that handles the new struct size.
      
      Fixes: 18fb76ed
      
       ("net-zerocopy: Copy straggler unaligned data for TCP Rx. zerocopy.")
      Signed-off-by: default avatarArjun Roy <arjunroy@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      71e38a0c
    • Vladimir Oltean's avatar
      net: mscc: ocelot: correctly report the timestamping RX filters in ethtool · 72f2117e
      Vladimir Oltean authored
      [ Upstream commit c49a35ee ]
      
      The driver doesn't support RX timestamping for non-PTP packets, but it
      declares that it does. Restrict the reported RX filters to PTP v2 over
      L2 and over L4.
      
      Fixes: 4e3b0468
      
       ("net: mscc: PTP Hardware Clock (PHC) support")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      72f2117e
    • Vladimir Oltean's avatar
      net: mscc: ocelot: don't downgrade timestamping RX filters in SIOCSHWTSTAMP · 73115a2b
      Vladimir Oltean authored
      [ Upstream commit 8a075464 ]
      
      The ocelot driver, when asked to timestamp all receiving packets, 1588
      v1 or NTP, says "nah, here's 1588 v2 for you".
      
      According to this discussion:
      https://patchwork.kernel.org/project/netdevbpf/patch/20211104133204.19757-8-martin.kaistra@linutronix.de/#24577647
      drivers that downgrade from a wider request to a narrower response (or
      even a response where the intersection with the request is empty) are
      buggy, and should return -ERANGE instead. This patch fixes that.
      
      Fixes: 4e3b0468
      
       ("net: mscc: PTP Hardware Clock (PHC) support")
      Suggested-by: default avatarRichard Cochran <richardcochran@gmail.com>
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Acked-by: default avatarRichard Cochran <richardcochran@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      73115a2b
    • Guangbin Huang's avatar
      net: hns3: fix VF RSS failed problem after PF enable multi-TCs · 62343dad
      Guangbin Huang authored
      [ Upstream commit 8d2ad993 ]
      
      When PF is set to multi-TCs and configured mapping relationship between
      priorities and TCs, the hardware will active these settings for this PF
      and its VFs.
      
      In this case when VF just uses one TC and its rx packets contain priority,
      and if the priority is not mapped to TC0, as other TCs of VF is not valid,
      hardware always put this kind of packets to the queue 0. It cause this kind
      of packets of VF can not be used RSS function.
      
      To fix this problem, set tc mode of all unused TCs of VF to the setting of
      TC0, then rx packet with priority which map to unused TC will be direct to
      TC0.
      
      Fixes: e2cb1dec
      
       ("net: hns3: Add HNS3 VF HCL(Hardware Compatibility Layer) Support")
      Signed-off-by: default avatarGuangbin Huang <huangguangbin2@huawei.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      62343dad
    • Tony Lu's avatar
      net/smc: Don't call clcsock shutdown twice when smc shutdown · 215167df
      Tony Lu authored
      [ Upstream commit bacb6c1e ]
      
      When applications call shutdown() with SHUT_RDWR in userspace,
      smc_close_active() calls kernel_sock_shutdown(), and it is called
      twice in smc_shutdown().
      
      This fixes this by checking sk_state before do clcsock shutdown, and
      avoids missing the application's call of smc_shutdown().
      
      Link: https://lore.kernel.org/linux-s390/1f67548e-cbf6-0dce-82b5-10288a4583bd@linux.ibm.com/
      Fixes: 606a63c9
      
       ("net/smc: Ensure the active closing peer first closes clcsock")
      Signed-off-by: default avatarTony Lu <tonylu@linux.alibaba.com>
      Reviewed-by: default avatarWen Gu <guwen@linux.alibaba.com>
      Acked-by: default avatarKarsten Graul <kgraul@linux.ibm.com>
      Link: https://lore.kernel.org/r/20211126024134.45693-1-tonylu@linux.alibaba.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      215167df
    • Ziyang Xuan's avatar
      net: vlan: fix underflow for the real_dev refcnt · 6e800ee4
      Ziyang Xuan authored
      [ Upstream commit 01d9cc2d ]
      
      Inject error before dev_hold(real_dev) in register_vlan_dev(),
      and execute the following testcase:
      
      ip link add dev dummy1 type dummy
      ip link add name dummy1.100 link dummy1 type vlan id 100
      ip link del dev dummy1
      
      When the dummy netdevice is removed, we will get a WARNING as following:
      
      =======================================================================
      refcount_t: decrement hit 0; leaking memory.
      WARNING: CPU: 2 PID: 0 at lib/refcount.c:31 refcount_warn_saturate+0xbf/0x1e0
      
      and an endless loop of:
      
      =======================================================================
      unregister_netdevice: waiting for dummy1 to become free. Usage count = -1073741824
      
      That is because dev_put(real_dev) in vlan_dev_free() be called without
      dev_hold(real_dev) in register_vlan_dev(). It makes the refcnt of real_dev
      underflow.
      
      Move the dev_hold(real_dev) to vlan_dev_init() which is the call-back of
      ndo_init(). That makes dev_hold() and dev_put() for vlan's real_dev
      symmetrical.
      
      Fixes: 563bcbae
      
       ("net: vlan: fix a UAF in vlan_dev_real_dev()")
      Reported-by: default avatarPetr Machata <petrm@nvidia.com>
      Suggested-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarZiyang Xuan <william.xuanziyang@huawei.com>
      Link: https://lore.kernel.org/r/20211126015942.2918542-1-william.xuanziyang@huawei.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      6e800ee4
    • Davide Caratti's avatar
      net/sched: sch_ets: don't peek at classes beyond 'nbands' · ae2659d2
      Davide Caratti authored
      [ Upstream commit de6d2592 ]
      
      when the number of DRR classes decreases, the round-robin active list can
      contain elements that have already been freed in ets_qdisc_change(). As a
      consequence, it's possible to see a NULL dereference crash, caused by the
      attempt to call cl->qdisc->ops->peek(cl->qdisc) when cl->qdisc is NULL:
      
       BUG: kernel NULL pointer dereference, address: 0000000000000018
       #PF: supervisor read access in kernel mode
       #PF: error_code(0x0000) - not-present page
       PGD 0 P4D 0
       Oops: 0000 [#1] PREEMPT SMP NOPTI
       CPU: 1 PID: 910 Comm: mausezahn Not tainted 5.16.0-rc1+ #475
       Hardware name: Red Hat KVM, BIOS 1.11.1-4.module+el8.1.0+4066+0f1aadab 04/01/2014
       RIP: 0010:ets_qdisc_dequeue+0x129/0x2c0 [sch_ets]
       Code: c5 01 41 39 ad e4 02 00 00 0f 87 18 ff ff ff 49 8b 85 c0 02 00 00 49 39 c4 0f 84 ba 00 00 00 49 8b ad c0 02 00 00 48 8b 7d 10 <48> 8b 47 18 48 8b 40 38 0f ae e8 ff d0 48 89 c3 48 85 c0 0f 84 9d
       RSP: 0000:ffffbb36c0b5fdd8 EFLAGS: 00010287
       RAX: ffff956678efed30 RBX: 0000000000000000 RCX: 0000000000000000
       RDX: 0000000000000002 RSI: ffffffff9b938dc9 RDI: 0000000000000000
       RBP: ffff956678efed30 R08: e2f3207fe360129c R09: 0000000000000000
       R10: 0000000000000001 R11: 0000000000000001 R12: ffff956678efeac0
       R13: ffff956678efe800 R14: ffff956611545000 R15: ffff95667ac8f100
       FS:  00007f2aa9120740(0000) GS:ffff95667b800000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 0000000000000018 CR3: 000000011070c000 CR4: 0000000000350ee0
       Call Trace:
        <TASK>
        qdisc_peek_dequeued+0x29/0x70 [sch_ets]
        tbf_dequeue+0x22/0x260 [sch_tbf]
        __qdisc_run+0x7f/0x630
        net_tx_action+0x290/0x4c0
        __do_softirq+0xee/0x4f8
        irq_exit_rcu+0xf4/0x130
        sysvec_apic_timer_interrupt+0x52/0xc0
        asm_sysvec_apic_timer_interrupt+0x12/0x20
       RIP: 0033:0x7f2aa7fc9ad4
       Code: b9 ff ff 48 8b 54 24 18 48 83 c4 08 48 89 ee 48 89 df 5b 5d e9 ed fc ff ff 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 f3 0f 1e fa <53> 48 83 ec 10 48 8b 05 10 64 33 00 48 8b 00 48 85 c0 0f 85 84 00
       RSP: 002b:00007ffe5d33fab8 EFLAGS: 00000202
       RAX: 0000000000000002 RBX: 0000561f72c31460 RCX: 0000561f72c31720
       RDX: 0000000000000002 RSI: 0000561f72c31722 RDI: 0000561f72c31720
       RBP: 000000000000002a R08: 00007ffe5d33fa40 R09: 0000000000000014
       R10: 0000000000000000 R11: 0000000000000246 R12: 0000561f7187e380
       R13: 0000000000000000 R14: 0000000000000000 R15: 0000561f72c31460
        </TASK>
       Modules linked in: sch_ets sch_tbf dummy rfkill iTCO_wdt intel_rapl_msr iTCO_vendor_support intel_rapl_common joydev virtio_balloon lpc_ich i2c_i801 i2c_smbus pcspkr ip_tables xfs libcrc32c crct10dif_pclmul crc32_pclmul crc32c_intel ahci libahci ghash_clmulni_intel serio_raw libata virtio_blk virtio_console virtio_net net_failover failover sunrpc dm_mirror dm_region_hash dm_log dm_mod
       CR2: 0000000000000018
      
      Ensuring that 'alist' was never zeroed [1] was not sufficient, we need to
      remove from the active list those elements that are no more SP nor DRR.
      
      [1] https://lore.kernel.org/netdev/60d274838bf09777f0371253416e8af71360bc08.1633609148.git.dcaratti@redhat.com/
      
      
      
      v3: fix race between ets_qdisc_change() and ets_qdisc_dequeue() delisting
          DRR classes beyond 'nbands' in ets_qdisc_change() with the qdisc lock
          acquired, thanks to Cong Wang.
      
      v2: when a NULL qdisc is found in the DRR active list, try to dequeue skb
          from the next list item.
      
      Reported-by: default avatarHangbin Liu <liuhangbin@gmail.com>
      Fixes: dcc68b4d
      
       ("net: sch_ets: Add a new Qdisc")
      Signed-off-by: default avatarDavide Caratti <dcaratti@redhat.com>
      Link: https://lore.kernel.org/r/7a5c496eed2d62241620bdbb83eb03fb9d571c99.1637762721.git.dcaratti@redhat.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      ae2659d2
    • Jakub Kicinski's avatar
      tls: fix replacing proto_ops · e3509feb
      Jakub Kicinski authored
      [ Upstream commit f3911f73 ]
      
      We replace proto_ops whenever TLS is configured for RX. But our
      replacement also overrides sendpage_locked, which will crash
      unless TX is also configured. Similarly we plug both of those
      in for TLS_HW (NIC crypto offload) even tho TLS_HW has a completely
      different implementation for TX.
      
      Last but not least we always plug in something based on inet_stream_ops
      even though a few of the callbacks differ for IPv6 (getname, release,
      bind).
      
      Use a callback building method similar to what we do for struct proto.
      
      Fixes: c46234eb ("tls: RX path for ktls")
      Fixes: d4ffb02d
      
       ("net/tls: enable sk_msg redirect to tls socket egress")
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      e3509feb
    • Jakub Kicinski's avatar
      tls: splice_read: fix record type check · 22156242
      Jakub Kicinski authored
      [ Upstream commit 520493f6 ]
      
      We don't support splicing control records. TLS 1.3 changes moved
      the record type check into the decrypt if(). The skb may already
      be decrypted and still be an alert.
      
      Note that decrypt_skb_update() is idempotent and updates ctx->decrypted
      so the if() is pointless.
      
      Reorder the check for decryption errors with the content type check
      while touching them. This part is not really a bug, because if
      decryption failed in TLS 1.3 content type will be DATA, and for
      TLS 1.2 it will be correct. Nevertheless its strange to touch output
      before checking if the function has failed.
      
      Fixes: fedf201e
      
       ("net: tls: Refactor control message handling on recv")
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      22156242
    • Huang Pei's avatar
      MIPS: use 3-level pgtable for 64KB page size on MIPS_VA_BITS_48 · 3b6c71c0
      Huang Pei authored
      [ Upstream commit 41ce097f ]
      
      It hangup when booting Loongson 3A1000 with BOTH
      CONFIG_PAGE_SIZE_64KB and CONFIG_MIPS_VA_BITS_48, that it turn
      out to use 2-level pgtable instead of 3-level. 64KB page size
      with 2-level pgtable only cover 42 bits VA, use 3-level pgtable
      to cover all 48 bits VA(55 bits)
      
      Fixes: 1e321fa9
      
       ("MIPS64: Support of at least 48 bits of SEGBITS)
      Signed-off-by: default avatarHuang Pei <huangpei@loongson.cn>
      Signed-off-by: default avatarThomas Bogendoerfer <tsbogend@alpha.franken.de>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      3b6c71c0
    • Huang Pei's avatar
      MIPS: loongson64: fix FTLB configuration · a6a5d853
      Huang Pei authored
      [ Upstream commit 7db5e9e9 ]
      
      It turns out that 'decode_configs' -> 'set_ftlb_enable' is called under
      c->cputype unset, which leaves FTLB disabled on BOTH 3A2000 and 3A3000
      
      Fix it by calling "decode_configs" after c->cputype is initialized
      
      Fixes: da1bd297
      
       ("MIPS: Loongson64: Probe CPU features via CPUCFG")
      Signed-off-by: default avatarHuang Pei <huangpei@loongson.cn>
      Signed-off-by: default avatarThomas Bogendoerfer <tsbogend@alpha.franken.de>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      a6a5d853
    • Jesse Brandeburg's avatar
      igb: fix netpoll exit with traffic · 5e823dbe
      Jesse Brandeburg authored
      [ Upstream commit eaeace60 ]
      
      Oleksandr brought a bug report where netpoll causes trace
      messages in the log on igb.
      
      Danielle brought this back up as still occurring, so we'll try
      again.
      
      [22038.710800] ------------[ cut here ]------------
      [22038.710801] igb_poll+0x0/0x1440 [igb] exceeded budget in poll
      [22038.710802] WARNING: CPU: 12 PID: 40362 at net/core/netpoll.c:155 netpoll_poll_dev+0x18a/0x1a0
      
      As Alex suggested, change the driver to return work_done at the
      exit of napi_poll, which should be safe to do in this driver
      because it is not polling multiple queues in this single napi
      context (multiple queues attached to one MSI-X vector). Several
      other drivers contain the same simple sequence, so I hope
      this will not create new problems.
      
      Fixes: 16eb8815
      
       ("igb: Refactor clean_rx_irq to reduce overhead and improve performance")
      Reported-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Reported-by: default avatarDanielle Ratson <danieller@nvidia.com>
      Suggested-by: default avatarAlexander Duyck <alexander.duyck@gmail.com>
      Signed-off-by: default avatarJesse Brandeburg <jesse.brandeburg@intel.com>
      Tested-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Tested-by: default avatarDanielle Ratson <danieller@nvidia.com>
      Link: https://lore.kernel.org/r/20211123204000.1597971-1-jesse.brandeburg@intel.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      5e823dbe
    • Maurizio Lombardi's avatar
      nvmet: use IOCB_NOWAIT only if the filesystem supports it · f2a58ff3
      Maurizio Lombardi authored
      [ Upstream commit c024b226 ]
      
      Submit I/O requests with the IOCB_NOWAIT flag set only if
      the underlying filesystem supports it.
      
      Fixes: 50a909db
      
       ("nvmet: use IOCB_NOWAIT for file-ns buffered I/O")
      Signed-off-by: default avatarMaurizio Lombardi <mlombard@redhat.com>
      Reviewed-by: default avatarChaitanya Kulkarni <kch@nvidia.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      f2a58ff3
    • Guo DaXing's avatar
      net/smc: Fix loop in smc_listen · 12ceb52f
      Guo DaXing authored
      [ Upstream commit 9ebb0c4b ]
      
      The kernel_listen function in smc_listen will fail when all the available
      ports are occupied.  At this point smc->clcsock->sk->sk_data_ready has
      been changed to smc_clcsock_data_ready.  When we call smc_listen again,
      now both smc->clcsock->sk->sk_data_ready and smc->clcsk_data_ready point
      to the smc_clcsock_data_ready function.
      
      The smc_clcsock_data_ready() function calls lsmc->clcsk_data_ready which
      now points to itself resulting in an infinite loop.
      
      This patch restores smc->clcsock->sk->sk_data_ready with the old value.
      
      Fixes: a60a2b1e
      
       ("net/smc: reduce active tcp_listen workers")
      Signed-off-by: default avatarGuo DaXing <guodaxing@huawei.com>
      Acked-by: default avatarTony Lu <tonylu@linux.alibaba.com>
      Signed-off-by: default avatarKarsten Graul <kgraul@linux.ibm.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      12ceb52f
    • Karsten Graul's avatar
      net/smc: Fix NULL pointer dereferencing in smc_vlan_by_tcpsk() · c94cbd26
      Karsten Graul authored
      [ Upstream commit 587acad4 ]
      
      Coverity reports a possible NULL dereferencing problem:
      
      in smc_vlan_by_tcpsk():
      6. returned_null: netdev_lower_get_next returns NULL (checked 29 out of 30 times).
      7. var_assigned: Assigning: ndev = NULL return value from netdev_lower_get_next.
      1623                ndev = (struct net_device *)netdev_lower_get_next(ndev, &lower);
      CID 1468509 (#1 of 1): Dereference null return value (NULL_RETURNS)
      8. dereference: Dereferencing a pointer that might be NULL ndev when calling is_vlan_dev.
      1624                if (is_vlan_dev(ndev)) {
      
      Remove the manual implementation and use netdev_walk_all_lower_dev() to
      iterate over the lower devices. While on it remove an obsolete function
      parameter comment.
      
      Fixes: cb9d43f6
      
       ("net/smc: determine vlan_id of stacked net_device")
      Suggested-by: default avatarJulian Wiedmann <jwi@linux.ibm.com>
      Signed-off-by: default avatarKarsten Graul <kgraul@linux.ibm.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      c94cbd26
    • Russell King (Oracle)'s avatar
      net: phylink: Force retrigger in case of latched link-fail indicator · 3d4937c6
      Russell King (Oracle) authored
      [ Upstream commit dbae3388 ]
      
      On mv88e6xxx 1G/2.5G PCS, the SerDes register 4.2001.2 has the following
      description:
        This register bit indicates when link was lost since the last
        read. For the current link status, read this register
        back-to-back.
      
      Thus to get current link state, we need to read the register twice.
      
      But doing that in the link change interrupt handler would lead to
      potentially ignoring link down events, which we really want to avoid.
      
      Thus this needs to be solved in phylink's resolve, by retriggering
      another resolve in the event when PCS reports link down and previous
      link was up, and by re-reading PCS state if the previous link was down.
      
      The wrong value is read when phylink requests change from sgmii to
      2500base-x mode, and link won't come up. This fixes the bug.
      
      Fixes: 9525ae83
      
       ("phylink: add phylink infrastructure")
      Signed-off-by: default avatarRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
      Signed-off-by: default avatarMarek Behún <kabel@kernel.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      3d4937c6
    • Russell King (Oracle)'s avatar
      net: phylink: Force link down and retrigger resolve on interface change · 50162ff3
      Russell King (Oracle) authored
      [ Upstream commit 80662f4f ]
      
      On PHY state change the phylink_resolve() function can read stale
      information from the MAC and report incorrect link speed and duplex to
      the kernel message log.
      
      Example with a Marvell 88X3310 PHY connected to a SerDes port on Marvell
      88E6393X switch:
      - PHY driver triggers state change due to PHY interface mode being
        changed from 10gbase-r to 2500base-x due to copper change in speed
        from 10Gbps to 2.5Gbps, but the PHY itself either hasn't yet changed
        its interface to the host, or the interrupt about loss of SerDes link
        hadn't arrived yet (there can be a delay of several milliseconds for
        this), so we still think that the 10gbase-r mode is up
      - phylink_resolve()
        - phylink_mac_pcs_get_state()
          - this fills in speed=10g link=up
        - interface mode is updated to 2500base-x but speed is left at 10Gbps
        - phylink_major_config()
          - interface is changed to 2500base-x
        - phylink_link_up()
          - mv88e6xxx_mac_link_up()
            - .port_set_speed_duplex()
              - speed is set to 10Gbps
          - reports "Link is Up - 10Gbps/Full" to dmesg
      
      Afterwards when the interrupt finally arrives for mv88e6xxx, another
      resolve is forced in which we get the correct speed from
      phylink_mac_pcs_get_state(), but since the interface is not being
      changed anymore, we don't call phylink_major_config() but only
      phylink_mac_config(), which does not set speed/duplex anymore.
      
      To fix this, we need to force the link down and trigger another resolve
      on PHY interface change event.
      
      Fixes: 9525ae83
      
       ("phylink: add phylink infrastructure")
      Signed-off-by: default avatarRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
      Signed-off-by: default avatarMarek Behún <kabel@kernel.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      50162ff3
    • Heiner Kallweit's avatar
      lan743x: fix deadlock in lan743x_phy_link_status_change() · 95ba8f0d
      Heiner Kallweit authored
      [ Upstream commit ddb826c2 ]
      
      Usage of phy_ethtool_get_link_ksettings() in the link status change
      handler isn't needed, and in combination with the referenced change
      it results in a deadlock. Simply remove the call and replace it with
      direct access to phydev->speed. The duplex argument of
      lan743x_phy_update_flowcontrol() isn't used and can be removed.
      
      Fixes: c10a485c
      
       ("phy: phy_ethtool_ksettings_get: Lock the phy for consistency")
      Reported-by: default avatarAlessandro B Maurici <abmaurici@gmail.com>
      Tested-by: default avatarAlessandro B Maurici <abmaurici@gmail.com>
      Signed-off-by: default avatarHeiner Kallweit <hkallweit1@gmail.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Link: https://lore.kernel.org/r/40e27f76-0ba3-dcef-ee32-a78b9df38b0f@gmail.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      95ba8f0d
    • Eric Dumazet's avatar
      tcp_cubic: fix spurious Hystart ACK train detections for not-cwnd-limited flows · c5e4316d
      Eric Dumazet authored
      [ Upstream commit 4e1fddc9 ]
      
      While testing BIG TCP patch series, I was expecting that TCP_RR workloads
      with 80KB requests/answers would send one 80KB TSO packet,
      then being received as a single GRO packet.
      
      It turns out this was not happening, and the root cause was that
      cubic Hystart ACK train was triggering after a few (2 or 3) rounds of RPC.
      
      Hystart was wrongly setting CWND/SSTHRESH to 30, while my RPC
      needed a budget of ~20 segments.
      
      Ideally these TCP_RR flows should not exit slow start.
      
      Cubic Hystart should reset itself at each round, instead of assuming
      every TCP flow is a bulk one.
      
      Note that even after this patch, Hystart can still trigger, depending
      on scheduling artifacts, but at a higher CWND/SSTHRESH threshold,
      keeping optimal TSO packet sizes.
      
      Tested:
      
      ip link set dev eth0 gro_ipv6_max_size 131072 gso_ipv6_max_size 131072
      nstat -n; netperf -H ... -t TCP_RR  -l 5  -- -r 80000,80000 -K cubic; nstat|egrep "Ip6InReceives|Hystart|Ip6OutRequests"
      
      Before:
      
         8605
      Ip6InReceives                   87541              0.0
      Ip6OutRequests                  129496             0.0
      TcpExtTCPHystartTrainDetect     1                  0.0
      TcpExtTCPHystartTrainCwnd       30                 0.0
      
      After:
      
        8760
      Ip6InReceives                   88514              0.0
      Ip6OutRequests                  87975              0.0
      
      Fixes: ae27e98a
      
       ("[TCP] CUBIC v2.3")
      Co-developed-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Link: https://lore.kernel.org/r/20211123202535.1843771-1-eric.dumazet@gmail.com
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      c5e4316d
    • Nicholas Kazlauskas's avatar
      drm/amd/display: Set plane update flags for all planes in reset · 31876230
      Nicholas Kazlauskas authored
      [ Upstream commit 21431f70 ]
      
      [Why]
      We're only setting the flags on stream[0]'s planes so this logic fails
      if we have more than one stream in the state.
      
      This can cause a page flip timeout with multiple displays in the
      configuration.
      
      [How]
      Index into the stream_status array using the stream index - it's a 1:1
      mapping.
      
      Fixes: cdaae837
      
       ("drm/amd/display: Handle GPU reset for DC block")
      
      Reviewed-by: default avatarHarry Wentland <Harry.Wentland@amd.com>
      Acked-by: default avatarQingqing Zhuo <qingqing.zhuo@amd.com>
      Signed-off-by: default avatarNicholas Kazlauskas <nicholas.kazlauskas@amd.com>
      Tested-by: default avatarDaniel Wheeler <daniel.wheeler@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      31876230
    • Thomas Zeitlhofer's avatar
      PM: hibernate: use correct mode for swsusp_close() · f634c755
      Thomas Zeitlhofer authored
      [ Upstream commit cefcf24b ]
      
      Commit 39fbef4b ("PM: hibernate: Get block device exclusively in
      swsusp_check()") changed the opening mode of the block device to
      (FMODE_READ | FMODE_EXCL).
      
      In the corresponding calls to swsusp_close(), the mode is still just
      FMODE_READ which triggers the warning in blkdev_flush_mapping() on
      resume from hibernate.
      
      So, use the mode (FMODE_READ | FMODE_EXCL) also when closing the
      device.
      
      Fixes: 39fbef4b
      
       ("PM: hibernate: Get block device exclusively in swsusp_check()")
      Signed-off-by: default avatarThomas Zeitlhofer <thomas.zeitlhofer+lkml@ze-it.at>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      f634c755
    • Kumar Thangavel's avatar
      net/ncsi : Add payload to be 32-bit aligned to fix dropped packets · 440bd9fa
      Kumar Thangavel authored
      [ Upstream commit ac132852 ]
      
      Update NC-SI command handler (both standard and OEM) to take into
      account of payload paddings in allocating skb (in case of payload
      size is not 32-bit aligned).
      
      The checksum field follows payload field, without taking payload
      padding into account can cause checksum being truncated, leading to
      dropped packets.
      
      Fixes: fb4ee675
      
       ("net/ncsi: Add NCSI OEM command support")
      Signed-off-by: default avatarKumar Thangavel <thangavel.k@hcl.com>
      Acked-by: default avatarSamuel Mendoza-Jonas <sam@mendozajonas.com>
      Reviewed-by: default avatarPaul Menzel <pmenzel@molgen.mpg.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      440bd9fa
    • Varun Prakash's avatar
      nvmet-tcp: fix incomplete data digest send · ac88cb3c
      Varun Prakash authored
      [ Upstream commit 102110ef ]
      
      Current nvmet_try_send_ddgst() code does not check whether
      all data digest bytes are transmitted, fix this by returning
      -EAGAIN if all data digest bytes are not transmitted.
      
      Fixes: 872d26a3
      
       ("nvmet-tcp: add NVMe over TCP target driver")
      Signed-off-by: default avatarVarun Prakash <varun@chelsio.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      ac88cb3c
    • Marek Behún's avatar
      net: marvell: mvpp2: increase MTU limit when XDP enabled · 8889ff80
      Marek Behún authored
      [ Upstream commit 7b1b62bc ]
      
      Currently mvpp2_xdp_setup won't allow attaching XDP program if
        mtu > ETH_DATA_LEN (1500).
      
      The mvpp2_change_mtu on the other hand checks whether
        MVPP2_RX_PKT_SIZE(mtu) > MVPP2_BM_LONG_PKT_SIZE.
      
      These two checks are semantically different.
      
      Moreover this limit can be increased to MVPP2_MAX_RX_BUF_SIZE, since in
      mvpp2_rx we have
        xdp.data = data + MVPP2_MH_SIZE + MVPP2_SKB_HEADROOM;
        xdp.frame_sz = PAGE_SIZE;
      
      Change the checks to check whether
        mtu > MVPP2_MAX_RX_BUF_SIZE
      
      Fixes: 07dd0a7a
      
       ("mvpp2: add basic XDP support")
      Signed-off-by: default avatarMarek Behún <kabel@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      8889ff80
    • Amit Cohen's avatar
      mlxsw: spectrum: Protect driver from buggy firmware · 90d07368
      Amit Cohen authored
      [ Upstream commit 63b08b1f ]
      
      When processing port up/down events generated by the device's firmware,
      the driver protects itself from events reported for non-existent local
      ports, but not the CPU port (local port 0), which exists, but lacks a
      netdev.
      
      This can result in a NULL pointer dereference when calling
      netif_carrier_{on,off}().
      
      Fix this by bailing early when processing an event reported for the CPU
      port. Problem was only observed when running on top of a buggy emulator.
      
      Fixes: 28b1987e
      
       ("mlxsw: spectrum: Register CPU port with devlink")
      Signed-off-by: default avatarAmit Cohen <amcohen@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      90d07368
    • Danielle Ratson's avatar
      mlxsw: Verify the accessed index doesn't exceed the array length · 33d89128
      Danielle Ratson authored
      [ Upstream commit 837ec05c
      
       ]
      
      There are few cases in which an array index queried from a fw register,
      is accessed without any validation that it doesn't exceed the array
      length.
      
      Add a proper length validation, so accessing memory past the end of an
      array will be forbidden.
      
      Signed-off-by: default avatarDanielle Ratson <danieller@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      33d89128
    • Tony Lu's avatar
      net/smc: Ensure the active closing peer first closes clcsock · 29e1b573
      Tony Lu authored
      [ Upstream commit 606a63c9 ]
      
      The side that actively closed socket, it's clcsock doesn't enter
      TIME_WAIT state, but the passive side does it. It should show the same
      behavior as TCP sockets.
      
      Consider this, when client actively closes the socket, the clcsock in
      server enters TIME_WAIT state, which means the address is occupied and
      won't be reused before TIME_WAIT dismissing. If we restarted server, the
      service would be unavailable for a long time.
      
      To solve this issue, shutdown the clcsock in [A], perform the TCP active
      close progress first, before the passive closed side closing it. So that
      the actively closed side enters TIME_WAIT, not the passive one.
      
      Client                                            |  Server
      close() // client actively close                  |
        smc_release()                                   |
            smc_close_active() // PEERCLOSEWAIT1        |
                smc_close_final() // abort or closed = 1|
                    smc_cdc_get_slot_and_msg_send()     |
                [A]                                     |
                                                        |smc_cdc_msg_recv_action() // ACTIVE
                                                        |  queue_work(smc_close_wq, &conn->close_work)
                                                        |    smc_close_passive_work() // PROCESSABORT or APPCLOSEWAIT1
                                                        |      smc_close_passive_abort_received() // only in abort
                                                        |
                                                        |close() // server recv zero, close
                                                        |  smc_release() // PROCESSABORT or APPCLOSEWAIT1
                                                        |    smc_close_active()
                                                        |      smc_close_abort() or smc_close_final() // CLOSED
                                                        |        smc_cdc_get_slot_and_msg_send() // abort or closed = 1
      smc_cdc_msg_recv_action()                         |    smc_clcsock_release()
        queue_work(smc_close_wq, &conn->close_work)     |      sock_release(tcp) // actively close clc, enter TIME_WAIT
          smc_close_passive_work() // PEERCLOSEWAIT1    |    smc_conn_free()
            smc_close_passive_abort_received() // CLOSED|
            smc_conn_free()                             |
            smc_clcsock_release()                       |
              sock_release(tcp) // passive close clc    |
      
      Link: https://www.spinics.net/lists/netdev/msg780407.html
      Fixes: b38d7324
      
       ("smc: socket closing and linkgroup cleanup")
      Signed-off-by: default avatarTony Lu <tonylu@linux.alibaba.com>
      Reviewed-by: default avatarWen Gu <guwen@linux.alibaba.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      29e1b573
    • Huang Jianan's avatar
      erofs: fix deadlock when shrink erofs slab · 77d9c2ef
      Huang Jianan authored
      [ Upstream commit 57bbeacd ]
      
      We observed the following deadlock in the stress test under low
      memory scenario:
      
      Thread A                               Thread B
      - erofs_shrink_scan
       - erofs_try_to_release_workgroup
        - erofs_workgroup_try_to_freeze -- A
                                             - z_erofs_do_read_page
                                              - z_erofs_collection_begin
                                               - z_erofs_register_collection
                                                - erofs_insert_workgroup
                                                 - xa_lock(&sbi->managed_pslots) -- B
                                                 - erofs_workgroup_get
                                                  - erofs_wait_on_workgroup_freezed -- A
        - xa_erase
         - xa_lock(&sbi->managed_pslots) -- B
      
      To fix this, it needs to hold xa_lock before freezing the workgroup
      since xarray will be touched then. So let's hold the lock before
      accessing each workgroup, just like what we did with the radix tree
      before.
      
      [ Gao Xiang: Jianhua Hao also reports this issue at
        https://lore.kernel.org/r/b10b85df30694bac8aadfe43537c897a@xiaomi.com ]
      
      Link: https://lore.kernel.org/r/20211118135844.3559-1-huangjianan@oppo.com
      Fixes: 64094a04
      
       ("erofs: convert workstn to XArray")
      Reviewed-by: default avatarChao Yu <chao@kernel.org>
      Reviewed-by: default avatarGao Xiang <hsiangkao@linux.alibaba.com>
      Signed-off-by: default avatarHuang Jianan <huangjianan@oppo.com>
      Reported-by: default avatarJianhua Hao <haojianhua1@xiaomi.com>
      Signed-off-by: default avatarGao Xiang <xiang@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      77d9c2ef
    • Shin'ichiro Kawasaki's avatar
      scsi: scsi_debug: Zero clear zones at reset write pointer · 9f540c7f
      Shin'ichiro Kawasaki authored
      [ Upstream commit 2d62253e ]
      
      When a reset is requested the position of the write pointer is updated but
      the data in the corresponding zone is not cleared. Instead scsi_debug
      returns any data written before the write pointer was reset. This is an
      error and prevents using scsi_debug for stale page cache testing of the
      BLKRESETZONE ioctl.
      
      Zero written data in the zone when resetting the write pointer.
      
      Link: https://lore.kernel.org/r/20211122061223.298890-1-shinichiro.kawasaki@wdc.com
      Fixes: f0d1cf93
      
       ("scsi: scsi_debug: Add ZBC zone commands")
      Reviewed-by: default avatarDamien Le Moal <damien.lemoal@opensource.wdc.com>
      Acked-by: default avatarDouglas Gilbert <dgilbert@interlog.com>
      Signed-off-by: default avatarShin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      9f540c7f
    • Mike Christie's avatar
      scsi: core: sysfs: Fix setting device state to SDEV_RUNNING · 725ba128
      Mike Christie authored
      [ Upstream commit eb97545d ]
      
      This fixes an issue added in commit 4edd8cd4 ("scsi: core: sysfs: Fix
      hang when device state is set via sysfs") where if userspace is requesting
      to set the device state to SDEV_RUNNING when the state is already
      SDEV_RUNNING, we return -EINVAL instead of count. The commmit above set ret
      to count for this case, when it should have set it to 0.
      
      Link: https://lore.kernel.org/r/20211120164917.4924-1-michael.christie@oracle.com
      Fixes: 4edd8cd4
      
       ("scsi: core: sysfs: Fix hang when device state is set via sysfs")
      Reviewed-by: default avatarLee Duncan <lduncan@suse.com>
      Signed-off-by: default avatarMike Christie <michael.christie@oracle.com>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      725ba128
    • Marta Plantykow's avatar
      ice: avoid bpf_prog refcount underflow · e65a8707
      Marta Plantykow authored
      [ Upstream commit f65ee535 ]
      
      Ice driver has the routines for managing XDP resources that are shared
      between ndo_bpf op and VSI rebuild flow. The latter takes place for
      example when user changes queue count on an interface via ethtool's
      set_channels().
      
      There is an issue around the bpf_prog refcounting when VSI is being
      rebuilt - since ice_prepare_xdp_rings() is called with vsi->xdp_prog as
      an argument that is used later on by ice_vsi_assign_bpf_prog(), same
      bpf_prog pointers are swapped with each other. Then it is also
      interpreted as an 'old_prog' which in turn causes us to call
      bpf_prog_put on it that will decrement its refcount.
      
      Below splat can be interpreted in a way that due to zero refcount of a
      bpf_prog it is wiped out from the system while kernel still tries to
      refer to it:
      
      [  481.069429] BUG: unable to handle page fault for address: ffffc9000640f038
      [  481.077390] #PF: supervisor read access in kernel mode
      [  481.083335] #PF: error_code(0x0000) - not-present page
      [  481.089276] PGD 100000067 P4D 100000067 PUD 1001cb067 PMD 106d2b067 PTE 0
      [  481.097141] Oops: 0000 [#1] PREEMPT SMP PTI
      [  481.101980] CPU: 12 PID: 3339 Comm: sudo Tainted: G           OE     5.15.0-rc5+ #1
      [  481.110840] Hardware name: Intel Corp. GRANTLEY/GRANTLEY, BIOS GRRFCRB1.86B.0276.D07.1605190235 05/19/2016
      [  481.122021] RIP: 0010:dev_xdp_prog_id+0x25/0x40
      [  481.127265] Code: 80 00 00 00 00 0f 1f 44 00 00 89 f6 48 c1 e6 04 48 01 fe 48 8b 86 98 08 00 00 48 85 c0 74 13 48 8b 50 18 31 c0 48 85 d2 74 07 <48> 8b 42 38 8b 40 20 c3 48 8b 96 90 08 00 00 eb e8 66 2e 0f 1f 84
      [  481.148991] RSP: 0018:ffffc90007b63868 EFLAGS: 00010286
      [  481.155034] RAX: 0000000000000000 RBX: ffff889080824000 RCX: 0000000000000000
      [  481.163278] RDX: ffffc9000640f000 RSI: ffff889080824010 RDI: ffff889080824000
      [  481.171527] RBP: ffff888107af7d00 R08: 0000000000000000 R09: ffff88810db5f6e0
      [  481.179776] R10: 0000000000000000 R11: ffff8890885b9988 R12: ffff88810db5f4bc
      [  481.188026] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
      [  481.196276] FS:  00007f5466d5bec0(0000) GS:ffff88903fb00000(0000) knlGS:0000000000000000
      [  481.205633] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  481.212279] CR2: ffffc9000640f038 CR3: 000000014429c006 CR4: 00000000003706e0
      [  481.220530] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  481.228771] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  481.237029] Call Trace:
      [  481.239856]  rtnl_fill_ifinfo+0x768/0x12e0
      [  481.244602]  rtnl_dump_ifinfo+0x525/0x650
      [  481.249246]  ? __alloc_skb+0xa5/0x280
      [  481.253484]  netlink_dump+0x168/0x3c0
      [  481.257725]  netlink_recvmsg+0x21e/0x3e0
      [  481.262263]  ____sys_recvmsg+0x87/0x170
      [  481.266707]  ? __might_fault+0x20/0x30
      [  481.271046]  ? _copy_from_user+0x66/0xa0
      [  481.275591]  ? iovec_from_user+0xf6/0x1c0
      [  481.280226]  ___sys_recvmsg+0x82/0x100
      [  481.284566]  ? sock_sendmsg+0x5e/0x60
      [  481.288791]  ? __sys_sendto+0xee/0x150
      [  481.293129]  __sys_recvmsg+0x56/0xa0
      [  481.297267]  do_syscall_64+0x3b/0xc0
      [  481.301395]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [  481.307238] RIP: 0033:0x7f5466f39617
      [  481.311373] Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb bd 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2f 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10
      [  481.342944] RSP: 002b:00007ffedc7f4308 EFLAGS: 00000246 ORIG_RAX: 000000000000002f
      [  481.361783] RAX: ffffffffffffffda RBX: 00007ffedc7f5460 RCX: 00007f5466f39617
      [  481.380278] RDX: 0000000000000000 RSI: 00007ffedc7f5360 RDI: 0000000000000003
      [  481.398500] RBP: 00007ffedc7f53f0 R08: 0000000000000000 R09: 000055d556f04d50
      [  481.416463] R10: 0000000000000077 R11: 0000000000000246 R12: 00007ffedc7f5360
      [  481.434131] R13: 00007ffedc7f5350 R14: 00007ffedc7f5344 R15: 0000000000000e98
      [  481.451520] Modules linked in: ice(OE) af_packet binfmt_misc nls_iso8859_1 ipmi_ssif intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp mxm_wmi mei_me coretemp mei ipmi_si ipmi_msghandler wmi acpi_pad acpi_power_meter ip_tables x_tables autofs4 crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel ahci crypto_simd cryptd libahci lpc_ich [last unloaded: ice]
      [  481.528558] CR2: ffffc9000640f038
      [  481.542041] ---[ end trace d1f24c9ecf5b61c1 ]---
      
      Fix this by only calling ice_vsi_assign_bpf_prog() inside
      ice_prepare_xdp_rings() when current vsi->xdp_prog pointer is NULL.
      This way set_channels() flow will not attempt to swap the vsi->xdp_prog
      pointers with itself.
      
      Also, sprinkle around some comments that provide a reasoning about
      correlation between driver and kernel in terms of bpf_prog refcount.
      
      Fixes: efc2214b
      
       ("ice: Add support for XDP")
      Reviewed-by: default avatarAlexander Lobakin <alexandr.lobakin@intel.com>
      Signed-off-by: default avatarMarta Plantykow <marta.a.plantykow@intel.com>
      Co-developed-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Tested-by: default avatarKiran Bhandare <kiranx.bhandare@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      e65a8707
    • Maciej Fijalkowski's avatar
      ice: fix vsi->txq_map sizing · 1eb5395a
      Maciej Fijalkowski authored
      [ Upstream commit 792b2086 ]
      
      The approach of having XDP queue per CPU regardless of user's setting
      exposed a hidden bug that could occur in case when Rx queue count differ
      from Tx queue count. Currently vsi->txq_map's size is equal to the
      doubled vsi->alloc_txq, which is not correct due to the fact that XDP
      rings were previously based on the Rx queue count. Below splat can be
      seen when ethtool -L is used and XDP rings are configured:
      
      [  682.875339] BUG: kernel NULL pointer dereference, address: 000000000000000f
      [  682.883403] #PF: supervisor read access in kernel mode
      [  682.889345] #PF: error_code(0x0000) - not-present page
      [  682.895289] PGD 0 P4D 0
      [  682.898218] Oops: 0000 [#1] PREEMPT SMP PTI
      [  682.903055] CPU: 42 PID: 2878 Comm: ethtool Tainted: G           OE     5.15.0-rc5+ #1
      [  682.912214] Hardware name: Intel Corp. GRANTLEY/GRANTLEY, BIOS GRRFCRB1.86B.0276.D07.1605190235 05/19/2016
      [  682.923380] RIP: 0010:devres_remove+0x44/0x130
      [  682.928527] Code: 49 89 f4 55 48 89 fd 4c 89 ff 53 48 83 ec 10 e8 92 b9 49 00 48 8b 9d a8 02 00 00 48 8d 8d a0 02 00 00 49 89 c2 48 39 cb 74 0f <4c> 3b 63 10 74 25 48 8b 5b 08 48 39 cb 75 f1 4c 89 ff 4c 89 d6 e8
      [  682.950237] RSP: 0018:ffffc90006a679f0 EFLAGS: 00010002
      [  682.956285] RAX: 0000000000000286 RBX: ffffffffffffffff RCX: ffff88908343a370
      [  682.964538] RDX: 0000000000000001 RSI: ffffffff81690d60 RDI: 0000000000000000
      [  682.972789] RBP: ffff88908343a0d0 R08: 0000000000000000 R09: 0000000000000000
      [  682.981040] R10: 0000000000000286 R11: 3fffffffffffffff R12: ffffffff81690d60
      [  682.989282] R13: ffffffff81690a00 R14: ffff8890819807a8 R15: ffff88908343a36c
      [  682.997535] FS:  00007f08c7bfa740(0000) GS:ffff88a03fd00000(0000) knlGS:0000000000000000
      [  683.006910] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  683.013557] CR2: 000000000000000f CR3: 0000001080a66003 CR4: 00000000003706e0
      [  683.021819] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  683.030075] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  683.038336] Call Trace:
      [  683.041167]  devm_kfree+0x33/0x50
      [  683.045004]  ice_vsi_free_arrays+0x5e/0xc0 [ice]
      [  683.050380]  ice_vsi_rebuild+0x4c8/0x750 [ice]
      [  683.055543]  ice_vsi_recfg_qs+0x9a/0x110 [ice]
      [  683.060697]  ice_set_channels+0x14f/0x290 [ice]
      [  683.065962]  ethnl_set_channels+0x333/0x3f0
      [  683.070807]  genl_family_rcv_msg_doit+0xea/0x150
      [  683.076152]  genl_rcv_msg+0xde/0x1d0
      [  683.080289]  ? channels_prepare_data+0x60/0x60
      [  683.085432]  ? genl_get_cmd+0xd0/0xd0
      [  683.089667]  netlink_rcv_skb+0x50/0xf0
      [  683.094006]  genl_rcv+0x24/0x40
      [  683.097638]  netlink_unicast+0x239/0x340
      [  683.102177]  netlink_sendmsg+0x22e/0x470
      [  683.106717]  sock_sendmsg+0x5e/0x60
      [  683.110756]  __sys_sendto+0xee/0x150
      [  683.114894]  ? handle_mm_fault+0xd0/0x2a0
      [  683.119535]  ? do_user_addr_fault+0x1f3/0x690
      [  683.134173]  __x64_sys_sendto+0x25/0x30
      [  683.148231]  do_syscall_64+0x3b/0xc0
      [  683.161992]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Fix this by taking into account the value that num_possible_cpus()
      yields in addition to vsi->alloc_txq instead of doubling the latter.
      
      Fixes: efc2214b ("ice: Add support for XDP")
      Fixes: 22bf877e
      
       ("ice: introduce XDP_TX fallback path")
      Reviewed-by: default avatarAlexander Lobakin <alexandr.lobakin@intel.com>
      Signed-off-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Tested-by: default avatarKiran Bhandare <kiranx.bhandare@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      1eb5395a
    • Nikolay Aleksandrov's avatar
      net: nexthop: release IPv6 per-cpu dsts when replacing a nexthop group · 26ed13d0
      Nikolay Aleksandrov authored
      [ Upstream commit 1005f19b ]
      
      When replacing a nexthop group, we must release the IPv6 per-cpu dsts of
      the removed nexthop entries after an RCU grace period because they
      contain references to the nexthop's net device and to the fib6 info.
      With specific series of events[1] we can reach net device refcount
      imbalance which is unrecoverable. IPv4 is not affected because dsts
      don't take a refcount on the route.
      
      [1]
       $ ip nexthop list
        id 200 via 2002:db8::2 dev bridge.10 scope link onlink
        id 201 via 2002:db8::3 dev bridge scope link onlink
        id 203 group 201/200
       $ ip -6 route
        2001:db8::10 nhid 203 metric 1024 pref medium
           nexthop via 2002:db8::3 dev bridge weight 1 onlink
           nexthop via 2002:db8::2 dev bridge.10 weight 1 onlink
      
      Create rt6_info through one of the multipath legs, e.g.:
       $ taskset -a -c 1  ./pkt_inj 24 bridge.10 2001:db8::10
       (pkt_inj is just a custom packet generator, nothing special)
      
      Then remove that leg from the group by replace (let's assume it is id
      200 in this case):
       $ ip nexthop replace id 203 group 201
      
      Now remove the IPv6 route:
       $ ip -6 route del 2001:db8::10/128
      
      The route won't be really deleted due to the stale rt6_info holding 1
      refcnt in nexthop id 200.
      At this point we have the following reference count dependency:
       (deleted) IPv6 route holds 1 reference over nhid 203
       nh 203 holds 1 ref over id 201
       nh 200 holds 1 ref over the net device and the route due to the stale
       rt6_info
      
      Now to create circular dependency between nh 200 and the IPv6 route, and
      also to get a reference over nh 200, restore nhid 200 in the group:
       $ ip nexthop replace id 203 group 201/200
      
      And now we have a permanent circular dependncy because nhid 203 holds a
      reference over nh 200 and 201, but the route holds a ref over nh 203 and
      is deleted.
      
      To trigger the bug just delete the group (nhid 203):
       $ ip nexthop del id 203
      
      It won't really be deleted due to the IPv6 route dependency, and now we
      have 2 unlinked and deleted objects that reference each other: the group
      and the IPv6 route. Since the group drops the reference it holds over its
      entries at free time (i.e. its own refcount needs to drop to 0) that will
      never happen and we get a permanent ref on them, since one of the entries
      holds a reference over the IPv6 route it will also never be released.
      
      At this point the dependencies are:
       (deleted, only unlinked) IPv6 route holds reference over group nh 203
       (deleted, only unlinked) group nh 203 holds reference over nh 201 and 200
       nh 200 holds 1 ref over the net device and the route due to the stale
       rt6_info
      
      This is the last point where it can be fixed by running traffic through
      nh 200, and specifically through the same CPU so the rt6_info (dst) will
      get released due to the IPv6 genid, that in turn will free the IPv6
      route, which in turn will free the ref count over the group nh 203.
      
      If nh 200 is deleted at this point, it will never be released due to the
      ref from the unlinked group 203, it will only be unlinked:
       $ ip nexthop del id 200
       $ ip nexthop
       $
      
      Now we can never release that stale rt6_info, we have IPv6 route with ref
      over group nh 203, group nh 203 with ref over nh 200 and 201, nh 200 with
      rt6_info (dst) with ref over the net device and the IPv6 route. All of
      these objects are only unlinked, and cannot be released, thus they can't
      release their ref counts.
      
       Message from syslogd@dev at Nov 19 14:04:10 ...
        kernel:[73501.828730] unregister_netdevice: waiting for bridge.10 to become free. Usage count = 3
       Message from syslogd@dev at Nov 19 14:04:20 ...
        kernel:[73512.068811] unregister_netdevice: waiting for bridge.10 to become free. Usage count = 3
      
      Fixes: 7bf4796d
      
       ("nexthops: add support for replace")
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      26ed13d0
    • Nikolay Aleksandrov's avatar
      net: ipv6: add fib6_nh_release_dsts stub · 3c405845
      Nikolay Aleksandrov authored
      [ Upstream commit 8837cbbf ]
      
      We need a way to release a fib6_nh's per-cpu dsts when replacing
      nexthops otherwise we can end up with stale per-cpu dsts which hold net
      device references, so add a new IPv6 stub called fib6_nh_release_dsts.
      It must be used after an RCU grace period, so no new dsts can be created
      through a group's nexthop entry.
      Similar to fib6_nh_release it shouldn't be used if fib6_nh_init has failed
      so it doesn't need a dummy stub when IPv6 is not enabled.
      
      Fixes: 7bf4796d
      
       ("nexthops: add support for replace")
      Signed-off-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      3c405845
    • Holger Assmann's avatar
      net: stmmac: retain PTP clock time during SIOCSHWTSTAMP ioctls · dc2f7e9d
      Holger Assmann authored
      [ Upstream commit a6da2bbb ]
      
      Currently, when user space emits SIOCSHWTSTAMP ioctl calls such as
      enabling/disabling timestamping or changing filter settings, the driver
      reads the current CLOCK_REALTIME value and programming this into the
      NIC's hardware clock. This might be necessary during system
      initialization, but at runtime, when the PTP clock has already been
      synchronized to a grandmaster, a reset of the timestamp settings might
      result in a clock jump. Furthermore, if the clock is also controlled by
      phc2sys in automatic mode (where the UTC offset is queried from ptp4l),
      that UTC-to-TAI offset (currently 37 seconds in 2021) would be
      temporarily reset to 0, and it would take a long time for phc2sys to
      readjust so that CLOCK_REALTIME and the PHC are apart by 37 seconds
      again.
      
      To address the issue, we introduce a new function called
      stmmac_init_tstamp_counter(), which gets called during ndo_open().
      It contains the code snippet moved from stmmac_hwtstamp_set() that
      manages the time synchronization. Besides, the sub second increment
      configuration is also moved here since the related values are hardware
      dependent and runtime invariant.
      
      Furthermore, the hardware clock must be kept running even when no time
      stamping mode is selected in order to retain the synchronized time base.
      That way, timestamping can be enabled again at any time only with the
      need to compensate the clock's natural drifting.
      
      As a side effect, this patch fixes the issue that ptp_clock_info::enable
      can be called before SIOCSHWTSTAMP and the driver (which looks at
      priv->systime_flags) was not prepared to handle that ordering.
      
      Fixes: 92ba6888
      
       ("stmmac: add the support for PTP hw clock driver")
      Reported-by: default avatarMichael Olbrich <m.olbrich@pengutronix.de>
      Signed-off-by: default avatarAhmad Fatoum <a.fatoum@pengutronix.de>
      Signed-off-by: default avatarHolger Assmann <h.assmann@pengutronix.de>
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      dc2f7e9d