Skip to content
  1. Apr 14, 2022
    • Chen-Yu Tsai's avatar
      net: stmmac: Fix unset max_speed difference between DT and non-DT platforms · 47fec613
      Chen-Yu Tsai authored
      [ Upstream commit c21cabb0 ]
      
      In commit 9cbadf09 ("net: stmmac: support max-speed device tree
      property"), when DT platforms don't set "max-speed", max_speed is set to
      -1; for non-DT platforms, it stays the default 0.
      
      Prior to commit eeef2f6b ("net: stmmac: Start adding phylink support"),
      the check for a valid max_speed setting was to check if it was greater
      than zero. This commit got it right, but subsequent patches just checked
      for non-zero, which is incorrect for DT platforms.
      
      In commit 92c3807b ("net: stmmac: convert to phylink_get_linkmodes()")
      the conversion switched completely to checking for non-zero value as a
      valid value, which caused 1000base-T to stop getting advertised by
      default.
      
      Instead of trying to fix all the checks, simply leave max_speed alone if
      DT property parsing fails.
      
      Fixes: 9cbadf09 ("net: stmmac: support max-speed device tree property")
      Fixes: 92c3807b
      
       ("net: stmmac: convert to phylink_get_linkmodes()")
      Signed-off-by: default avatarChen-Yu Tsai <wens@csie.org>
      Acked-by: default avatarRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
      Reviewed-by: default avatarSrinivas Kandagatla <srinivas.kandagatla@linaro.org>
      Link: https://lore.kernel.org/r/20220331184832.16316-1-wens@kernel.org
      
      
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      47fec613
    • Nikolay Aleksandrov's avatar
      net: ipv4: fix route with nexthop object delete warning · 907c9798
      Nikolay Aleksandrov authored
      [ Upstream commit 6bf92d70 ]
      
      FRR folks have hit a kernel warning[1] while deleting routes[2] which is
      caused by trying to delete a route pointing to a nexthop id without
      specifying nhid but matching on an interface. That is, a route is found
      but we hit a warning while matching it. The warning is from
      fib_info_nh() in include/net/nexthop.h because we run it on a fib_info
      with nexthop object. The call chain is:
       inet_rtm_delroute -> fib_table_delete -> fib_nh_match (called with a
      nexthop fib_info and also with fc_oif set thus calling fib_info_nh on
      the fib_info and triggering the warning). The fix is to not do any
      matching in that branch if the fi has a nexthop object because those are
      managed separately. I.e. we should match when deleting without nh spec and
      should fail when deleting a nexthop route with old-style nh spec because
      nexthop objects are managed separately, e.g.:
       $ ip r show 1.2.3.4/32
       1.2.3.4 nhid 12 via 192.168.11.2 dev dummy0
      
       $ ip r del 1.2.3.4/32
       $ ip r del 1.2.3.4/32 nhid 12
       <both should work>
      
       $ ip r del 1.2.3.4/32 dev dummy0
       <should fail with ESRCH>
      
      [1]
       [  523.462226] ------------[ cut here ]------------
       [  523.462230] WARNING: CPU: 14 PID: 22893 at include/net/nexthop.h:468 fib_nh_match+0x210/0x460
       [  523.462236] Modules linked in: dummy rpcsec_gss_krb5 xt_socket nf_socket_ipv4 nf_socket_ipv6 ip6table_raw iptable_raw bpf_preload xt_statistic ip_set ip_vs_sh ip_vs_wrr ip_vs_rr ip_vs xt_mark nf_tables xt_nat veth nf_conntrack_netlink nfnetlink xt_addrtype br_netfilter overlay dm_crypt nfsv3 nfs fscache netfs vhost_net vhost vhost_iotlb tap tun xt_CHECKSUM xt_MASQUERADE xt_conntrack 8021q garp mrp ipt_REJECT nf_reject_ipv4 ip6table_mangle ip6table_nat iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter bridge stp llc rfcomm snd_seq_dummy snd_hrtimer rpcrdma rdma_cm iw_cm ib_cm ib_core ip6table_filter xt_comment ip6_tables vboxnetadp(OE) vboxnetflt(OE) vboxdrv(OE) qrtr bnep binfmt_misc xfs vfat fat squashfs loop nvidia_drm(POE) nvidia_modeset(POE) nvidia_uvm(POE) nvidia(POE) intel_rapl_msr intel_rapl_common snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi btusb btrtl iwlmvm uvcvideo btbcm snd_hda_intel edac_mce_amd
       [  523.462274]  videobuf2_vmalloc videobuf2_memops btintel snd_intel_dspcfg videobuf2_v4l2 snd_intel_sdw_acpi bluetooth snd_usb_audio snd_hda_codec mac80211 snd_usbmidi_lib joydev snd_hda_core videobuf2_common kvm_amd snd_rawmidi snd_hwdep snd_seq videodev ccp snd_seq_device libarc4 ecdh_generic mc snd_pcm kvm iwlwifi snd_timer drm_kms_helper snd cfg80211 cec soundcore irqbypass rapl wmi_bmof i2c_piix4 rfkill k10temp pcspkr acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc drm zram ip_tables crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel nvme sp5100_tco r8169 nvme_core wmi ipmi_devintf ipmi_msghandler fuse
       [  523.462300] CPU: 14 PID: 22893 Comm: ip Tainted: P           OE     5.16.18-200.fc35.x86_64 #1
       [  523.462302] Hardware name: Micro-Star International Co., Ltd. MS-7C37/MPG X570 GAMING EDGE WIFI (MS-7C37), BIOS 1.C0 10/29/2020
       [  523.462303] RIP: 0010:fib_nh_match+0x210/0x460
       [  523.462304] Code: 7c 24 20 48 8b b5 90 00 00 00 e8 bb ee f4 ff 48 8b 7c 24 20 41 89 c4 e8 ee eb f4 ff 45 85 e4 0f 85 2e fe ff ff e9 4c ff ff ff <0f> 0b e9 17 ff ff ff 3c 0a 0f 85 61 fe ff ff 48 8b b5 98 00 00 00
       [  523.462306] RSP: 0018:ffffaa53d4d87928 EFLAGS: 00010286
       [  523.462307] RAX: 0000000000000000 RBX: ffffaa53d4d87a90 RCX: ffffaa53d4d87bb0
       [  523.462308] RDX: ffff9e3d2ee6be80 RSI: ffffaa53d4d87a90 RDI: ffffffff920ed380
       [  523.462309] RBP: ffff9e3d2ee6be80 R08: 0000000000000064 R09: 0000000000000000
       [  523.462310] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000031
       [  523.462310] R13: 0000000000000020 R14: 0000000000000000 R15: ffff9e3d331054e0
       [  523.462311] FS:  00007f245517c1c0(0000) GS:ffff9e492ed80000(0000) knlGS:0000000000000000
       [  523.462313] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       [  523.462313] CR2: 000055e5dfdd8268 CR3: 00000003ef488000 CR4: 0000000000350ee0
       [  523.462315] Call Trace:
       [  523.462316]  <TASK>
       [  523.462320]  fib_table_delete+0x1a9/0x310
       [  523.462323]  inet_rtm_delroute+0x93/0x110
       [  523.462325]  rtnetlink_rcv_msg+0x133/0x370
       [  523.462327]  ? _copy_to_iter+0xb5/0x6f0
       [  523.462330]  ? rtnl_calcit.isra.0+0x110/0x110
       [  523.462331]  netlink_rcv_skb+0x50/0xf0
       [  523.462334]  netlink_unicast+0x211/0x330
       [  523.462336]  netlink_sendmsg+0x23f/0x480
       [  523.462338]  sock_sendmsg+0x5e/0x60
       [  523.462340]  ____sys_sendmsg+0x22c/0x270
       [  523.462341]  ? import_iovec+0x17/0x20
       [  523.462343]  ? sendmsg_copy_msghdr+0x59/0x90
       [  523.462344]  ? __mod_lruvec_page_state+0x85/0x110
       [  523.462348]  ___sys_sendmsg+0x81/0xc0
       [  523.462350]  ? netlink_seq_start+0x70/0x70
       [  523.462352]  ? __dentry_kill+0x13a/0x180
       [  523.462354]  ? __fput+0xff/0x250
       [  523.462356]  __sys_sendmsg+0x49/0x80
       [  523.462358]  do_syscall_64+0x3b/0x90
       [  523.462361]  entry_SYSCALL_64_after_hwframe+0x44/0xae
       [  523.462364] RIP: 0033:0x7f24552aa337
       [  523.462365] Code: 0e 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b9 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10
       [  523.462366] RSP: 002b:00007fff7f05a838 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
       [  523.462368] RAX: ffffffffffffffda RBX: 000000006245bf91 RCX: 00007f24552aa337
       [  523.462368] RDX: 0000000000000000 RSI: 00007fff7f05a8a0 RDI: 0000000000000003
       [  523.462369] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000
       [  523.462370] R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000001
       [  523.462370] R13: 00007fff7f05ce08 R14: 0000000000000000 R15: 000055e5dfdd1040
       [  523.462373]  </TASK>
       [  523.462374] ---[ end trace ba537bc16f6bf4ed ]---
      
      [2] https://github.com/FRRouting/frr/issues/6412
      
      Fixes: 4c7e8084
      
       ("ipv4: Plumb support for nexthop object in a fib_info")
      Signed-off-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      907c9798
    • Matt Johnston's avatar
      mctp: Fix check for dev_hard_header() result · 71d28e50
      Matt Johnston authored
      [ Upstream commit 60be976a ]
      
      dev_hard_header() returns the length of the header, so
      we need to test for negative errors rather than non-zero.
      
      Fixes: 889b7da2
      
       ("mctp: Add initial routing framework")
      Signed-off-by: default avatarMatt Johnston <matt@codeconstruct.com.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      71d28e50
    • Ivan Vecera's avatar
      ice: Clear default forwarding VSI during VSI release · 40229b29
      Ivan Vecera authored
      [ Upstream commit bd8c624c ]
      
      VSI is set as default forwarding one when promisc mode is set for
      PF interface, when PF is switched to switchdev mode or when VF
      driver asks to enable allmulticast or promisc mode for the VF
      interface (when vf-true-promisc-support priv flag is off).
      The third case is buggy because in that case VSI associated with
      VF remains as default one after VF removal.
      
      Reproducer:
      1. Create VF
         echo 1 > sys/class/net/ens7f0/device/sriov_numvfs
      2. Enable allmulticast or promisc mode on VF
         ip link set ens7f0v0 allmulticast on
         ip link set ens7f0v0 promisc on
      3. Delete VF
         echo 0 > sys/class/net/ens7f0/device/sriov_numvfs
      4. Try to enable promisc mode on PF
         ip link set ens7f0 promisc on
      
      Although it looks that promisc mode on PF is enabled the opposite
      is true because ice_vsi_sync_fltr() responsible for IFF_PROMISC
      handling first checks if any other VSI is set as default forwarding
      one and if so the function does not do anything. At this point
      it is not possible to enable promisc mode on PF without re-probe
      device.
      
      To resolve the issue this patch clear default forwarding VSI
      during ice_vsi_release() when the VSI to be released is the default
      one.
      
      Fixes: 01b5e89a
      
       ("ice: Add VF promiscuous support")
      Signed-off-by: default avatarIvan Vecera <ivecera@redhat.com>
      Reviewed-by: default avatarMichal Swiatkowski <michal.swiatkowski@linux.intel.com>
      Reviewed-by: default avatarMaciej Fijalkowski <maciej.fijalkowski@intel.com>
      Signed-off-by: default avatarAlice Michael <alice.michael@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      40229b29
    • Jean-Philippe Brucker's avatar
      skbuff: fix coalescing for page_pool fragment recycling · ba965e86
      Jean-Philippe Brucker authored
      [ Upstream commit 1effe8ca ]
      
      Fix a use-after-free when using page_pool with page fragments. We
      encountered this problem during normal RX in the hns3 driver:
      
      (1) Initially we have three descriptors in the RX queue. The first one
          allocates PAGE1 through page_pool, and the other two allocate one
          half of PAGE2 each. Page references look like this:
      
                      RX_BD1 _______ PAGE1
                      RX_BD2 _______ PAGE2
                      RX_BD3 _________/
      
      (2) Handle RX on the first descriptor. Allocate SKB1, eventually added
          to the receive queue by tcp_queue_rcv().
      
      (3) Handle RX on the second descriptor. Allocate SKB2 and pass it to
          netif_receive_skb():
      
          netif_receive_skb(SKB2)
            ip_rcv(SKB2)
              SKB3 = skb_clone(SKB2)
      
          SKB2 and SKB3 share a reference to PAGE2 through
          skb_shinfo()->dataref. The other ref to PAGE2 is still held by
          RX_BD3:
      
                            SKB2 ---+- PAGE2
                            SKB3 __/   /
                      RX_BD3 _________/
      
       (3b) Now while handling TCP, coalesce SKB3 with SKB1:
      
            tcp_v4_rcv(SKB3)
              tcp_try_coalesce(to=SKB1, from=SKB3)    // succeeds
              kfree_skb_partial(SKB3)
                skb_release_data(SKB3)                // drops one dataref
      
                            SKB1 _____ PAGE1
                                 \____
                            SKB2 _____ PAGE2
                                       /
                      RX_BD3 _________/
      
          In skb_try_coalesce(), __skb_frag_ref() takes a page reference to
          PAGE2, where it should instead have increased the page_pool frag
          reference, pp_frag_count. Without coalescing, when releasing both
          SKB2 and SKB3, a single reference to PAGE2 would be dropped. Now
          when releasing SKB1 and SKB2, two references to PAGE2 will be
          dropped, resulting in underflow.
      
       (3c) Drop SKB2:
      
            af_packet_rcv(SKB2)
              consume_skb(SKB2)
                skb_release_data(SKB2)                // drops second dataref
                  page_pool_return_skb_page(PAGE2)    // drops one pp_frag_count
      
                            SKB1 _____ PAGE1
                                 \____
                                       PAGE2
                                       /
                      RX_BD3 _________/
      
      (4) Userspace calls recvmsg()
          Copies SKB1 and releases it. Since SKB3 was coalesced with SKB1, we
          release the SKB3 page as well:
      
          tcp_eat_recv_skb(SKB1)
            skb_release_data(SKB1)
              page_pool_return_skb_page(PAGE1)
              page_pool_return_skb_page(PAGE2)        // drops second pp_frag_count
      
      (5) PAGE2 is freed, but the third RX descriptor was still using it!
          In our case this causes IOMMU faults, but it would silently corrupt
          memory if the IOMMU was disabled.
      
      Change the logic that checks whether pp_recycle SKBs can be coalesced.
      We still reject differing pp_recycle between 'from' and 'to' SKBs, but
      in order to avoid the situation described above, we also reject
      coalescing when both 'from' and 'to' are pp_recycled and 'from' is
      cloned.
      
      The new logic allows coalescing a cloned pp_recycle SKB into a page
      refcounted one, because in this case the release (4) will drop the right
      reference, the one taken by skb_try_coalesce().
      
      Fixes: 53e0961d
      
       ("page_pool: add frag page recycling support in page pool")
      Suggested-by: default avatarAlexander Duyck <alexanderduyck@fb.com>
      Signed-off-by: default avatarJean-Philippe Brucker <jean-philippe@linaro.org>
      Reviewed-by: default avatarYunsheng Lin <linyunsheng@huawei.com>
      Reviewed-by: default avatarAlexander Duyck <alexanderduyck@fb.com>
      Acked-by: default avatarIlias Apalodimas <ilias.apalodimas@linaro.org>
      Acked-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      ba965e86
    • Eyal Birger's avatar
      vrf: fix packet sniffing for traffic originating from ip tunnels · 13bcc6f8
      Eyal Birger authored
      [ Upstream commit 012d69fb ]
      
      in commit 04893908
      ("vrf: add mac header for tunneled packets when sniffer is attached")
      an Ethernet header was cooked for traffic originating from tunnel devices.
      
      However, the header is added based on whether the mac_header is unset
      and ignores cases where the device doesn't expose a mac header to upper
      layers, such as in ip tunnels like ipip and gre.
      
      Traffic originating from such devices still appears garbled when capturing
      on the vrf device.
      
      Fix by observing whether the original device exposes a header to upper
      layers, similar to the logic done in af_packet.
      
      In addition, skb->mac_len needs to be adjusted after adding the Ethernet
      header for the skb_push/pull() surrounding dev_queue_xmit_nit() to work
      on these packets.
      
      Fixes: 04893908
      
       ("vrf: add mac header for tunneled packets when sniffer is attached")
      Signed-off-by: default avatarEyal Birger <eyal.birger@gmail.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      13bcc6f8
    • Ziyang Xuan's avatar
      net/tls: fix slab-out-of-bounds bug in decrypt_internal · 6e2f1b03
      Ziyang Xuan authored
      [ Upstream commit 9381fe8c ]
      
      The memory size of tls_ctx->rx.iv for AES128-CCM is 12 setting in
      tls_set_sw_offload(). The return value of crypto_aead_ivsize()
      for "ccm(aes)" is 16. So memcpy() require 16 bytes from 12 bytes
      memory space will trigger slab-out-of-bounds bug as following:
      
      ==================================================================
      BUG: KASAN: slab-out-of-bounds in decrypt_internal+0x385/0xc40 [tls]
      Read of size 16 at addr ffff888114e84e60 by task tls/10911
      
      Call Trace:
       <TASK>
       dump_stack_lvl+0x34/0x44
       print_report.cold+0x5e/0x5db
       ? decrypt_internal+0x385/0xc40 [tls]
       kasan_report+0xab/0x120
       ? decrypt_internal+0x385/0xc40 [tls]
       kasan_check_range+0xf9/0x1e0
       memcpy+0x20/0x60
       decrypt_internal+0x385/0xc40 [tls]
       ? tls_get_rec+0x2e0/0x2e0 [tls]
       ? process_rx_list+0x1a5/0x420 [tls]
       ? tls_setup_from_iter.constprop.0+0x2e0/0x2e0 [tls]
       decrypt_skb_update+0x9d/0x400 [tls]
       tls_sw_recvmsg+0x3c8/0xb50 [tls]
      
      Allocated by task 10911:
       kasan_save_stack+0x1e/0x40
       __kasan_kmalloc+0x81/0xa0
       tls_set_sw_offload+0x2eb/0xa20 [tls]
       tls_setsockopt+0x68c/0x700 [tls]
       __sys_setsockopt+0xfe/0x1b0
      
      Replace the crypto_aead_ivsize() with prot->iv_size + prot->salt_size
      when memcpy() iv value in TLS_1_3_VERSION scenario.
      
      Fixes: f295b3ae
      
       ("net/tls: Add support of AES128-CCM based ciphers")
      Signed-off-by: default avatarZiyang Xuan <william.xuanziyang@huawei.com>
      Reviewed-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      6e2f1b03
    • Taehee Yoo's avatar
      net: sfc: add missing xdp queue reinitialization · ed7a824f
      Taehee Yoo authored
      [ Upstream commit 059a47f1 ]
      
      After rx/tx ring buffer size is changed, kernel panic occurs when
      it acts XDP_TX or XDP_REDIRECT.
      
      When tx/rx ring buffer size is changed(ethtool -G), sfc driver
      reallocates and reinitializes rx and tx queues and their buffer
      (tx_queue->buffer).
      But it misses reinitializing xdp queues(efx->xdp_tx_queues).
      So, while it is acting XDP_TX or XDP_REDIRECT, it uses the uninitialized
      tx_queue->buffer.
      
      A new function efx_set_xdp_channels() is separated from efx_set_channels()
      to handle only xdp queues.
      
      Splat looks like:
         BUG: kernel NULL pointer dereference, address: 000000000000002a
         #PF: supervisor write access in kernel mode
         #PF: error_code(0x0002) - not-present page
         PGD 0 P4D 0
         Oops: 0002 [#4] PREEMPT SMP NOPTI
         RIP: 0010:efx_tx_map_chunk+0x54/0x90 [sfc]
         CPU: 2 PID: 0 Comm: swapper/2 Tainted: G      D           5.17.0+ #55 e8beeee8289528f11357029357cf
         Code: 48 8b 8d a8 01 00 00 48 8d 14 52 4c 8d 2c d0 44 89 e0 48 85 c9 74 0e 44 89 e2 4c 89 f6 48 80
         RSP: 0018:ffff92f121e45c60 EFLAGS: 00010297
         RIP: 0010:efx_tx_map_chunk+0x54/0x90 [sfc]
         RAX: 0000000000000040 RBX: ffff92ea506895c0 RCX: ffffffffc0330870
         RDX: 0000000000000001 RSI: 00000001139b10ce RDI: ffff92ea506895c0
         RBP: ffffffffc0358a80 R08: 00000001139b110d R09: 0000000000000000
         R10: 0000000000000001 R11: ffff92ea414c0088 R12: 0000000000000040
         R13: 0000000000000018 R14: 00000001139b10ce R15: ffff92ea506895c0
         FS:  0000000000000000(0000) GS:ffff92f121ec0000(0000) knlGS:0000000000000000
         CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
         Code: 48 8b 8d a8 01 00 00 48 8d 14 52 4c 8d 2c d0 44 89 e0 48 85 c9 74 0e 44 89 e2 4c 89 f6 48 80
         CR2: 000000000000002a CR3: 00000003e6810004 CR4: 00000000007706e0
         RSP: 0018:ffff92f121e85c60 EFLAGS: 00010297
         PKRU: 55555554
         RAX: 0000000000000040 RBX: ffff92ea50689700 RCX: ffffffffc0330870
         RDX: 0000000000000001 RSI: 00000001145a90ce RDI: ffff92ea50689700
         RBP: ffffffffc0358a80 R08: 00000001145a910d R09: 0000000000000000
         R10: 0000000000000001 R11: ffff92ea414c0088 R12: 0000000000000040
         R13: 0000000000000018 R14: 00000001145a90ce R15: ffff92ea50689700
         FS:  0000000000000000(0000) GS:ffff92f121e80000(0000) knlGS:0000000000000000
         CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
         CR2: 000000000000002a CR3: 00000003e6810005 CR4: 00000000007706e0
         PKRU: 55555554
         Call Trace:
          <IRQ>
          efx_xdp_tx_buffers+0x12b/0x3d0 [sfc 84c94b8e32d44d296c17e10a634d3ad454de4ba5]
          __efx_rx_packet+0x5c3/0x930 [sfc 84c94b8e32d44d296c17e10a634d3ad454de4ba5]
          efx_rx_packet+0x28c/0x2e0 [sfc 84c94b8e32d44d296c17e10a634d3ad454de4ba5]
          efx_ef10_ev_process+0x5f8/0xf40 [sfc 84c94b8e32d44d296c17e10a634d3ad454de4ba5]
          ? enqueue_task_fair+0x95/0x550
          efx_poll+0xc4/0x360 [sfc 84c94b8e32d44d296c17e10a634d3ad454de4ba5]
      
      Fixes: 3990a8ff
      
       ("sfc: allocate channels for XDP tx queues")
      Signed-off-by: default avatarTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      ed7a824f
    • Jason Wang's avatar
      vdpa: mlx5: prevent cvq work from hogging CPU · 69ec350a
      Jason Wang authored
      [ Upstream commit 55ebf0d6 ]
      
      A userspace triggerable infinite loop could happen in
      mlx5_cvq_kick_handler() if userspace keeps sending a huge amount of
      cvq requests.
      
      Fixing this by introducing a quota and re-queue the work if we're out
      of the budget (currently the implicit budget is one) . While at it,
      using a per device work struct to avoid on demand memory allocation
      for cvq.
      
      Fixes: 5262912e
      
       ("vdpa/mlx5: Add support for control VQ and MAC setting")
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      Link: https://lore.kernel.org/r/20220329042109.4029-1-jasowang@redhat.com
      
      
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Acked-by: default avatarEli Cohen <elic@nvidia.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      69ec350a
    • Eli Cohen's avatar
      vdpa/mlx5: Propagate link status from device to vdpa driver · 893c70f8
      Eli Cohen authored
      [ Upstream commit edf747af
      
       ]
      
      Add code to register to hardware asynchronous events. Use this
      mechanism to track link status events coming from the device and update
      the config struct.
      
      After doing link status change, call the vdpa callback to notify of the
      link status change.
      
      Signed-off-by: default avatarEli Cohen <elic@nvidia.com>
      Link: https://lore.kernel.org/r/20210909123635.30884-4-elic@nvidia.com
      
      
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Acked-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      893c70f8
    • Eli Cohen's avatar
      vdpa/mlx5: Rename control VQ workqueue to vdpa wq · dc872b72
      Eli Cohen authored
      [ Upstream commit 218bdd20
      
       ]
      
      A subesequent patch will use the same workqueue for executing other
      work not related to control VQ. Rename the workqueue and the work queue
      entry used to convey information to the workqueue.
      
      Signed-off-by: default avatarEli Cohen <elic@nvidia.com>
      Link: https://lore.kernel.org/r/20210909123635.30884-3-elic@nvidia.com
      
      
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Acked-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      dc872b72
    • Christophe JAILLET's avatar
      scsi: zorro7xx: Fix a resource leak in zorro7xx_remove_one() · aefd755a
      Christophe JAILLET authored
      [ Upstream commit 16ed828b ]
      
      The error handling path of the probe releases a resource that is not freed
      in the remove function. In some cases, a ioremap() must be undone.
      
      Add the missing iounmap() call in the remove function.
      
      Link: https://lore.kernel.org/r/247066a3104d25f9a05de8b3270fc3c848763bcc.1647673264.git.christophe.jaillet@wanadoo.fr
      Fixes: 45804fbb
      
       ("[SCSI] 53c700: Amiga Zorro NCR53c710 SCSI")
      Reviewed-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: default avatarChristophe JAILLET <christophe.jaillet@wanadoo.fr>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      aefd755a
    • John Garry's avatar
      scsi: core: Fix sbitmap depth in scsi_realloc_sdev_budget_map() · cd483e17
      John Garry authored
      [ Upstream commit eaba83b5 ]
      
      In commit edb854a3 ("scsi: core: Reallocate device's budget map on
      queue depth change"), the sbitmap for the device budget map may be
      reallocated after the slave device depth is configured.
      
      When the sbitmap is reallocated we use the result from
      scsi_device_max_queue_depth() for the sbitmap size, but don't resize to
      match the actual device queue depth.
      
      Fix by resizing the sbitmap after reallocating the budget sbitmap. We do
      this instead of init'ing the sbitmap to the device queue depth as the user
      may want to change the queue depth later via sysfs or other.
      
      Link: https://lore.kernel.org/r/1647423870-143867-1-git-send-email-john.garry@huawei.com
      Fixes: edb854a3
      
       ("scsi: core: Reallocate device's budget map on queue depth change")
      Tested-by: default avatarDamien Le Moal <damien.lemoal@opensource.wdc.com>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Reviewed-by: default avatarBart Van Assche <bvanassche@acm.org>
      Signed-off-by: default avatarJohn Garry <john.garry@huawei.com>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      cd483e17
    • Kevin Groeneveld's avatar
      scsi: sr: Fix typo in CDROM(CLOSETRAY|EJECT) handling · 0610371c
      Kevin Groeneveld authored
      [ Upstream commit bc5519c1 ]
      
      Commit 2e27f576 ("scsi: scsi_ioctl: Call scsi_cmd_ioctl() from
      scsi_ioctl()") seems to have a typo as it is checking ret instead of cmd in
      the if statement checking for CDROMCLOSETRAY and CDROMEJECT.  This changes
      the behaviour of these ioctls as the cdrom_ioctl handling of these is more
      restrictive than the scsi_ioctl version.
      
      Link: https://lore.kernel.org/r/20220323002242.21157-1-kgroeneveld@lenbrook.com
      Fixes: 2e27f576
      
       ("scsi: scsi_ioctl: Call scsi_cmd_ioctl() from scsi_ioctl()")
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarKevin Groeneveld <kgroeneveld@lenbrook.com>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      0610371c
    • ChenXiaoSong's avatar
      NFSv4: fix open failure with O_ACCMODE flag · 6f52d4cd
      ChenXiaoSong authored
      [ Upstream commit b243874f ]
      
      open() with O_ACCMODE|O_DIRECT flags secondly will fail.
      
      Reproducer:
        1. mount -t nfs -o vers=4.2 $server_ip:/ /mnt/
        2. fd = open("/mnt/file", O_ACCMODE|O_DIRECT|O_CREAT)
        3. close(fd)
        4. fd = open("/mnt/file", O_ACCMODE|O_DIRECT)
      
      Server nfsd4_decode_share_access() will fail with error nfserr_bad_xdr when
      client use incorrect share access mode of 0.
      
      Fix this by using NFS4_SHARE_ACCESS_BOTH share access mode in client,
      just like firstly opening.
      
      Fixes: ce4ef7c0
      
       ("NFS: Split out NFS v4 file operations")
      Signed-off-by: default avatarChenXiaoSong <chenxiaosong2@huawei.com>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      6f52d4cd
    • ChenXiaoSong's avatar
      Revert "NFSv4: Handle the special Linux file open access mode" · 9f0c2174
      ChenXiaoSong authored
      [ Upstream commit ab0fc21b ]
      
      This reverts commit 44942b4e
      
      .
      
      After secondly opening a file with O_ACCMODE|O_DIRECT flags,
      nfs4_valid_open_stateid() will dereference NULL nfs4_state when lseek().
      
      Reproducer:
        1. mount -t nfs -o vers=4.2 $server_ip:/ /mnt/
        2. fd = open("/mnt/file", O_ACCMODE|O_DIRECT|O_CREAT)
        3. close(fd)
        4. fd = open("/mnt/file", O_ACCMODE|O_DIRECT)
        5. lseek(fd)
      
      Reported-by: default avatarLyu Tao <tao.lyu@epfl.ch>
      Signed-off-by: default avatarChenXiaoSong <chenxiaosong2@huawei.com>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      9f0c2174
    • Guilherme G. Piccoli's avatar
      Drivers: hv: vmbus: Fix potential crash on module unload · dcd6b1a6
      Guilherme G. Piccoli authored
      [ Upstream commit 792f232d ]
      
      The vmbus driver relies on the panic notifier infrastructure to perform
      some operations when a panic event is detected. Since vmbus can be built
      as module, it is required that the driver handles both registering and
      unregistering such panic notifier callback.
      
      After commit 74347a99 ("x86/Hyper-V: Unload vmbus channel in hv panic callback")
      though, the panic notifier registration is done unconditionally in the module
      initialization routine whereas the unregistering procedure is conditionally
      guarded and executes only if HV_FEATURE_GUEST_CRASH_MSR_AVAILABLE capability
      is set.
      
      This patch fixes that by unconditionally unregistering the panic notifier
      in the module's exit routine as well.
      
      Fixes: 74347a99
      
       ("x86/Hyper-V: Unload vmbus channel in hv panic callback")
      Signed-off-by: default avatarGuilherme G. Piccoli <gpiccoli@igalia.com>
      Reviewed-by: default avatarMichael Kelley <mikelley@microsoft.com>
      Link: https://lore.kernel.org/r/20220315203535.682306-1-gpiccoli@igalia.com
      
      
      Signed-off-by: default avatarWei Liu <wei.liu@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      dcd6b1a6
    • Dan Carpenter's avatar
      drm/amdgpu: fix off by one in amdgpu_gfx_kiq_acquire() · 5ba9d78a
      Dan Carpenter authored
      [ Upstream commit 1647b54e ]
      
      This post-op should be a pre-op so that we do not pass -1 as the bit
      number to test_bit().  The current code will loop downwards from 63 to
      -1.  After changing to a pre-op, it loops from 63 to 0.
      
      Fixes: 71c37505
      
       ("drm/amdgpu/gfx: move more common KIQ code to amdgpu_gfx.c")
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      5ba9d78a
    • Mateusz Jończyk's avatar
      rtc: mc146818-lib: fix RTC presence check · 985d87e6
      Mateusz Jończyk authored
      [ Upstream commit ea6fa496 ]
      
      To prevent an infinite loop in mc146818_get_time(),
      commit 211e5db1 ("rtc: mc146818: Detect and handle broken RTCs")
      added a check for RTC availability. Together with a later fix, it
      checked if bit 6 in register 0x0d is cleared.
      
      This, however, caused a false negative on a motherboard with an AMD
      SB710 southbridge; according to the specification [1], bit 6 of register
      0x0d of this chipset is a scratchbit. This caused a regression in Linux
      5.11 - the RTC was determined broken by the kernel and not used by
      rtc-cmos.c [3]. This problem was also reported in Fedora [4].
      
      As a better alternative, check whether the UIP ("Update-in-progress")
      bit is set for longer then 10ms. If that is the case, then apparently
      the RTC is either absent (and all register reads return 0xff) or broken.
      Also limit the number of loop iterations in mc146818_get_time() to 10 to
      prevent an infinite loop there.
      
      The functions mc146818_get_time() and mc146818_does_rtc_work() will be
      refactored later in this patch series, in order to fix a separate
      problem with reading / setting the RTC alarm time. This is done so to
      avoid a confusion about what is being fixed when.
      
      In a previous approach to this problem, I implemented a check whether
      the RTC_HOURS register contains a value <= 24. This, however, sometimes
      did not work correctly on my Intel Kaby Lake laptop. According to
      Intel's documentation [2], "the time and date RAM locations (0-9) are
      disconnected from the external bus" during the update cycle so reading
      this register without checking the UIP bit is incorrect.
      
      [1] AMD SB700/710/750 Register Reference Guide, page 308,
      https://developer.amd.com/wordpress/media/2012/10/43009_sb7xx_rrg_pub_1.00.pdf
      
      [2] 7th Generation Intel ® Processor Family I/O for U/Y Platforms [...] Datasheet
      Volume 1 of 2, page 209
      Intel's Document Number: 334658-006,
      https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/7th-and-8th-gen-core-family-mobile-u-y-processor-lines-i-o-datasheet-vol-1.pdf
      
      [3] Functions in arch/x86/kernel/rtc.c apparently were using it.
      
      [4] https://bugzilla.redhat.com/show_bug.cgi?id=1936688
      
      Fixes: 211e5db1 ("rtc: mc146818: Detect and handle broken RTCs")
      Fixes: ebb22a05
      
       ("rtc: mc146818: Dont test for bit 0-5 in Register D")
      Signed-off-by: default avatarMateusz Jończyk <mat.jonczyk@o2.pl>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Alessandro Zummo <a.zummo@towertech.it>
      Cc: Alexandre Belloni <alexandre.belloni@bootlin.com>
      Signed-off-by: default avatarAlexandre Belloni <alexandre.belloni@bootlin.com>
      Link: https://lore.kernel.org/r/20211210200131.153887-5-mat.jonczyk@o2.pl
      
      
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      985d87e6
    • Mateusz Jończyk's avatar
      rtc: Check return value from mc146818_get_time() · be6c3152
      Mateusz Jończyk authored
      [ Upstream commit 0dd8d6cb
      
       ]
      
      There are 4 users of mc146818_get_time() and none of them was checking
      the return value from this function. Change this.
      
      Print the appropriate warnings in callers of mc146818_get_time() instead
      of in the function mc146818_get_time() itself, in order not to add
      strings to rtc-mc146818-lib.c, which is kind of a library.
      
      The callers of alpha_rtc_read_time() and cmos_read_time() may use the
      contents of (struct rtc_time *) even when the functions return a failure
      code. Therefore, set the contents of (struct rtc_time *) to 0x00,
      which looks more sensible then 0xff and aligns with the (possibly
      stale?) comment in cmos_read_time:
      
      	/*
      	 * If pm_trace abused the RTC for storage, set the timespec to 0,
      	 * which tells the caller that this RTC value is unusable.
      	 */
      
      For consistency, do this in mc146818_get_time().
      
      Note: hpet_rtc_interrupt() may call mc146818_get_time() many times a
      second. It is very unlikely, though, that the RTC suddenly stops
      working and mc146818_get_time() would consistently fail.
      
      Only compile-tested on alpha.
      
      Signed-off-by: default avatarMateusz Jończyk <mat.jonczyk@o2.pl>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Alessandro Zummo <a.zummo@towertech.it>
      Cc: Alexandre Belloni <alexandre.belloni@bootlin.com>
      Cc: linux-alpha@vger.kernel.org
      Cc: x86@kernel.org
      Signed-off-by: default avatarAlexandre Belloni <alexandre.belloni@bootlin.com>
      Link: https://lore.kernel.org/r/20211210200131.153887-4-mat.jonczyk@o2.pl
      
      
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      be6c3152
    • Mateusz Jończyk's avatar
      rtc: mc146818-lib: change return values of mc146818_get_time() · 8c692107
      Mateusz Jończyk authored
      [ Upstream commit d35786b3
      
       ]
      
      No function is checking mc146818_get_time() return values yet, so
      correct them to make them more customary.
      
      Signed-off-by: default avatarMateusz Jończyk <mat.jonczyk@o2.pl>
      Cc: Alessandro Zummo <a.zummo@towertech.it>
      Cc: Alexandre Belloni <alexandre.belloni@bootlin.com>
      Signed-off-by: default avatarAlexandre Belloni <alexandre.belloni@bootlin.com>
      Link: https://lore.kernel.org/r/20211210200131.153887-3-mat.jonczyk@o2.pl
      
      
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      8c692107
    • Mauricio Faria de Oliveira's avatar
      mm: fix race between MADV_FREE reclaim and blkdev direct IO read · c9f50e06
      Mauricio Faria de Oliveira authored
      commit 6c8e2a25 upstream.
      
      Problem:
      =======
      
      Userspace might read the zero-page instead of actual data from a direct IO
      read on a block device if the buffers have been called madvise(MADV_FREE)
      on earlier (this is discussed below) due to a race between page reclaim on
      MADV_FREE and blkdev direct IO read.
      
      - Race condition:
        ==============
      
      During page reclaim, the MADV_FREE page check in try_to_unmap_one() checks
      if the page is not dirty, then discards its rmap PTE(s) (vs.  remap back
      if the page is dirty).
      
      However, after try_to_unmap_one() returns to shrink_page_list(), it might
      keep the page _anyway_ if page_ref_freeze() fails (it expects exactly
      _one_ page reference, from the isolation for page reclaim).
      
      Well, blkdev_direct_IO() gets references for all pages, and on READ
      operations it only sets them dirty _later_.
      
      So, if MADV_FREE'd pages (i.e., not dirty) are used as buffers for direct
      IO read from block devices, and page reclaim happens during
      __blkdev_direct_IO[_simple]() exactly AFTER bio_iov_iter_get_pages()
      returns, but BEFORE the pages are set dirty, the situation happens.
      
      The direct IO read eventually completes.  Now, when userspace reads the
      buffers, the PTE is no longer there and the page fault handler
      do_anonymous_page() services that with the zero-page, NOT the data!
      
      A synthetic reproducer is provided.
      
      - Page faults:
        ===========
      
      If page reclaim happens BEFORE bio_iov_iter_get_pages() the issue doesn't
      happen, because that faults-in all pages as writeable, so
      do_anonymous_page() sets up a new page/rmap/PTE, and that is used by
      direct IO.  The userspace reads don't fault as the PTE is there (thus
      zero-page is not used/setup).
      
      But if page reclaim happens AFTER it / BEFORE setting pages dirty, the PTE
      is no longer there; the subsequent page faults can't help:
      
      The data-read from the block device probably won't generate faults due to
      DMA (no MMU) but even in the case it wouldn't use DMA, that happens on
      different virtual addresses (not user-mapped addresses) because `struct
      bio_vec` stores `struct page` to figure addresses out (which are different
      from user-mapped addresses) for the read.
      
      Thus userspace reads (to user-mapped addresses) still fault, then
      do_anonymous_page() gets another `struct page` that would address/ map to
      other memory than the `struct page` used by `struct bio_vec` for the read.
      (The original `struct page` is not available, since it wasn't freed, as
      page_ref_freeze() failed due to more page refs.  And even if it were
      available, its data cannot be trusted anymore.)
      
      Solution:
      ========
      
      One solution is to check for the expected page reference count in
      try_to_unmap_one().
      
      There should be one reference from the isolation (that is also checked in
      shrink_page_list() with page_ref_freeze()) plus one or more references
      from page mapping(s) (put in discard: label).  Further references mean
      that rmap/PTE cannot be unmapped/nuked.
      
      (Note: there might be more than one reference from mapping due to
      fork()/clone() without CLONE_VM, which use the same `struct page` for
      references, until the copy-on-write page gets copied.)
      
      So, additional page references (e.g., from direct IO read) now prevent the
      rmap/PTE from being unmapped/dropped; similarly to the page is not freed
      per shrink_page_list()/page_ref_freeze()).
      
      - Races and Barriers:
        ==================
      
      The new check in try_to_unmap_one() should be safe in races with
      bio_iov_iter_get_pages() in get_user_pages() fast and slow paths, as it's
      done under the PTE lock.
      
      The fast path doesn't take the lock, but it checks if the PTE has changed
      and if so, it drops the reference and leaves the page for the slow path
      (which does take that lock).
      
      The fast path requires synchronization w/ full memory barrier: it writes
      the page reference count first then it reads the PTE later, while
      try_to_unmap() writes PTE first then it reads page refcount.
      
      And a second barrier is needed, as the page dirty flag should not be read
      before the page reference count (as in __remove_mapping()).  (This can be
      a load memory barrier only; no writes are involved.)
      
      Call stack/comments:
      
      - try_to_unmap_one()
        - page_vma_mapped_walk()
          - map_pte()			# see pte_offset_map_lock():
              pte_offset_map()
              spin_lock()
      
        - ptep_get_and_clear()	# write PTE
        - smp_mb()			# (new barrier) GUP fast path
        - page_ref_count()		# (new check) read refcount
      
        - page_vma_mapped_walk_done()	# see pte_unmap_unlock():
            pte_unmap()
            spin_unlock()
      
      - bio_iov_iter_get_pages()
        - __bio_iov_iter_get_pages()
          - iov_iter_get_pages()
            - get_user_pages_fast()
              - internal_get_user_pages_fast()
      
                # fast path
                - lockless_pages_from_mm()
                  - gup_{pgd,p4d,pud,pmd,pte}_range()
                      ptep = pte_offset_map()		# not _lock()
                      pte = ptep_get_lockless(ptep)
      
                      page = pte_page(pte)
                      try_grab_compound_head(page)	# inc refcount
                                                  	# (RMW/barrier
                                                   	#  on success)
      
                      if (pte_val(pte) != pte_val(*ptep)) # read PTE
                              put_compound_head(page) # dec refcount
                              			# go slow path
      
                # slow path
                - __gup_longterm_unlocked()
                  - get_user_pages_unlocked()
                    - __get_user_pages_locked()
                      - __get_user_pages()
                        - follow_{page,p4d,pud,pmd}_mask()
                          - follow_page_pte()
                              ptep = pte_offset_map_lock()
                              pte = *ptep
                              page = vm_normal_page(pte)
                              try_grab_page(page)	# inc refcount
                              pte_unmap_unlock()
      
      - Huge Pages:
        ==========
      
      Regarding transparent hugepages, that logic shouldn't change, as MADV_FREE
      (aka lazyfree) pages are PageAnon() && !PageSwapBacked()
      (madvise_free_pte_range() -> mark_page_lazyfree() -> lru_lazyfree_fn())
      thus should reach shrink_page_list() -> split_huge_page_to_list() before
      try_to_unmap[_one](), so it deals with normal pages only.
      
      (And in case unlikely/TTU_SPLIT_HUGE_PMD/split_huge_pmd_address() happens,
      which should not or be rare, the page refcount should be greater than
      mapcount: the head page is referenced by tail pages.  That also prevents
      checking the head `page` then incorrectly call page_remove_rmap(subpage)
      for a tail page, that isn't even in the shrink_page_list()'s page_list (an
      effect of split huge pmd/pmvw), as it might happen today in this unlikely
      scenario.)
      
      MADV_FREE'd buffers:
      ===================
      
      So, back to the "if MADV_FREE pages are used as buffers" note.  The case
      is arguable, and subject to multiple interpretations.
      
      The madvise(2) manual page on the MADV_FREE advice value says:
      
      1) 'After a successful MADV_FREE ... data will be lost when
         the kernel frees the pages.'
      2) 'the free operation will be canceled if the caller writes
         into the page' / 'subsequent writes ... will succeed and
         then [the] kernel cannot free those dirtied pages'
      3) 'If there is no subsequent write, the kernel can free the
         pages at any time.'
      
      Thoughts, questions, considerations... respectively:
      
      1) Since the kernel didn't actually free the page (page_ref_freeze()
         failed), should the data not have been lost? (on userspace read.)
      2) Should writes performed by the direct IO read be able to cancel
         the free operation?
         - Should the direct IO read be considered as 'the caller' too,
           as it's been requested by 'the caller'?
         - Should the bio technique to dirty pages on return to userspace
           (bio_check_pages_dirty() is called/used by __blkdev_direct_IO())
           be considered in another/special way here?
      3) Should an upcoming write from a previously requested direct IO
         read be considered as a subsequent write, so the kernel should
         not free the pages? (as it's known at the time of page reclaim.)
      
      And lastly:
      
      Technically, the last point would seem a reasonable consideration and
      balance, as the madvise(2) manual page apparently (and fairly) seem to
      assume that 'writes' are memory access from the userspace process (not
      explicitly considering writes from the kernel or its corner cases; again,
      fairly)..  plus the kernel fix implementation for the corner case of the
      largely 'non-atomic write' encompassed by a direct IO read operation, is
      relatively simple; and it helps.
      
      Reproducer:
      ==========
      
      @ test.c (simplified, but works)
      
      	#define _GNU_SOURCE
      	#include <fcntl.h>
      	#include <stdio.h>
      	#include <unistd.h>
      	#include <sys/mman.h>
      
      	int main() {
      		int fd, i;
      		char *buf;
      
      		fd = open(DEV, O_RDONLY | O_DIRECT);
      
      		buf = mmap(NULL, BUF_SIZE, PROT_READ | PROT_WRITE,
                      	   MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
      
      		for (i = 0; i < BUF_SIZE; i += PAGE_SIZE)
      			buf[i] = 1; // init to non-zero
      
      		madvise(buf, BUF_SIZE, MADV_FREE);
      
      		read(fd, buf, BUF_SIZE);
      
      		for (i = 0; i < BUF_SIZE; i += PAGE_SIZE)
      			printf("%p: 0x%x\n", &buf[i], buf[i]);
      
      		return 0;
      	}
      
      @ block/fops.c (formerly fs/block_dev.c)
      
      	+#include <linux/swap.h>
      	...
      	... __blkdev_direct_IO[_simple](...)
      	{
      	...
      	+	if (!strcmp(current->comm, "good"))
      	+		shrink_all_memory(ULONG_MAX);
      	+
               	ret = bio_iov_iter_get_pages(...);
      	+
      	+	if (!strcmp(current->comm, "bad"))
      	+		shrink_all_memory(ULONG_MAX);
      	...
      	}
      
      @ shell
      
              # NUM_PAGES=4
              # PAGE_SIZE=$(getconf PAGE_SIZE)
      
              # yes | dd of=test.img bs=${PAGE_SIZE} count=${NUM_PAGES}
              # DEV=$(losetup -f --show test.img)
      
              # gcc -DDEV=\"$DEV\" \
                    -DBUF_SIZE=$((PAGE_SIZE * NUM_PAGES)) \
                    -DPAGE_SIZE=${PAGE_SIZE} \
                     test.c -o test
      
              # od -tx1 $DEV
              0000000 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a
              *
              0040000
      
              # mv test good
              # ./good
              0x7f7c10418000: 0x79
              0x7f7c10419000: 0x79
              0x7f7c1041a000: 0x79
              0x7f7c1041b000: 0x79
      
              # mv good bad
              # ./bad
              0x7fa1b8050000: 0x0
              0x7fa1b8051000: 0x0
              0x7fa1b8052000: 0x0
              0x7fa1b8053000: 0x0
      
      Note: the issue is consistent on v5.17-rc3, but it's intermittent with the
      support of MADV_FREE on v4.5 (60%-70% error; needs swap).  [wrap
      do_direct_IO() in do_blockdev_direct_IO() @ fs/direct-io.c].
      
      - v5.17-rc3:
      
              # for i in {1..1000}; do ./good; done \
                  | cut -d: -f2 | sort | uniq -c
                 4000  0x79
      
              # mv good bad
              # for i in {1..1000}; do ./bad; done \
                  | cut -d: -f2 | sort | uniq -c
                 4000  0x0
      
              # free | grep Swap
              Swap:             0           0           0
      
      - v4.5:
      
              # for i in {1..1000}; do ./good; done \
                  | cut -d: -f2 | sort | uniq -c
                 4000  0x79
      
              # mv good bad
              # for i in {1..1000}; do ./bad; done \
                  | cut -d: -f2 | sort | uniq -c
                 2702  0x0
                 1298  0x79
      
              # swapoff -av
              swapoff /swap
      
              # for i in {1..1000}; do ./bad; done \
                  | cut -d: -f2 | sort | uniq -c
                 4000  0x79
      
      Ceph/TCMalloc:
      =============
      
      For documentation purposes, the use case driving the analysis/fix is Ceph
      on Ubuntu 18.04, as the TCMalloc library there still uses MADV_FREE to
      release unused memory to the system from the mmap'ed page heap (might be
      committed back/used again; it's not munmap'ed.) - PageHeap::DecommitSpan()
      -> TCMalloc_SystemRelease() -> madvise() - PageHeap::CommitSpan() ->
      TCMalloc_SystemCommit() -> do nothing.
      
      Note: TCMalloc switched back to MADV_DONTNEED a few commits after the
      release in Ubuntu 18.04 (google-perftools/gperftools 2.5), so the issue
      just 'disappeared' on Ceph on later Ubuntu releases but is still present
      in the kernel, and can be hit by other use cases.
      
      The observed issue seems to be the old Ceph bug #22464 [1], where checksum
      mismatches are observed (and instrumentation with buffer dumps shows
      zero-pages read from mmap'ed/MADV_FREE'd page ranges).
      
      The issue in Ceph was reasonably deemed a kernel bug (comment #50) and
      mostly worked around with a retry mechanism, but other parts of Ceph could
      still hit that (rocksdb).  Anyway, it's less likely to be hit again as
      TCMalloc switched out of MADV_FREE by default.
      
      (Some kernel versions/reports from the Ceph bug, and relation with
      the MADV_FREE introduction/changes; TCMalloc versions not checked.)
      - 4.4 good
      - 4.5 (madv_free: introduction)
      - 4.9 bad
      - 4.10 good? maybe a swapless system
      - 4.12 (madv_free: no longer free instantly on swapless systems)
      - 4.13 bad
      
      [1] https://tracker.ceph.com/issues/22464
      
      Thanks:
      ======
      
      Several people contributed to analysis/discussions/tests/reproducers in
      the first stages when drilling down on ceph/tcmalloc/linux kernel:
      
      - Dan Hill
      - Dan Streetman
      - Dongdong Tao
      - Gavin Guo
      - Gerald Yang
      - Heitor Alves de Siqueira
      - Ioanna Alifieraki
      - Jay Vosburgh
      - Matthew Ruffell
      - Ponnuvel Palaniyappan
      
      Reviews, suggestions, corrections, comments:
      
      - Minchan Kim
      - Yu Zhao
      - Huang, Ying
      - John Hubbard
      - Christoph Hellwig
      
      [mfo@canonical.com: v4]
        Link: https://lkml.kernel.org/r/20220209202659.183418-1-mfo@canonical.comLink: https://lkml.kernel.org/r/20220131230255.789059-1-mfo@canonical.com
      
      Fixes: 802a3a92
      
       ("mm: reclaim MADV_FREE pages")
      Signed-off-by: default avatarMauricio Faria de Oliveira <mfo@canonical.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Dan Hill <daniel.hill@canonical.com>
      Cc: Dan Streetman <dan.streetman@canonical.com>
      Cc: Dongdong Tao <dongdong.tao@canonical.com>
      Cc: Gavin Guo <gavin.guo@canonical.com>
      Cc: Gerald Yang <gerald.yang@canonical.com>
      Cc: Heitor Alves de Siqueira <halves@canonical.com>
      Cc: Ioanna Alifieraki <ioanna-maria.alifieraki@canonical.com>
      Cc: Jay Vosburgh <jay.vosburgh@canonical.com>
      Cc: Matthew Ruffell <matthew.ruffell@canonical.com>
      Cc: Ponnuvel Palaniyappan <ponnuvel.palaniyappan@canonical.com>
      Cc: <stable@vger.kernel.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      [mfo: backport: replace folio/test_flag with page/flag equivalents;
       real Fixes: 854e9ed0
      
       ("mm: support madvise(MADV_FREE)") in v4.]
      Signed-off-by: default avatarMauricio Faria de Oliveira <mfo@canonical.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      c9f50e06
    • John David Anglin's avatar
      parisc: Fix patch code locking and flushing · 93a8347f
      John David Anglin authored
      [ Upstream commit a9fe7fa7
      
       ]
      
      This change fixes the following:
      
      1) The flags variable is not initialized. Always use raw_spin_lock_irqsave
      and raw_spin_unlock_irqrestore to serialize patching.
      
      2) flush_kernel_vmap_range is primarily intended for DMA flushes. Since
      __patch_text_multiple is often called with interrupts disabled, it is
      better to directly call flush_kernel_dcache_range_asm and
      flush_kernel_icache_range_asm. This avoids an extra call.
      
      3) The final call to flush_icache_range is unnecessary.
      
      Signed-off-by: default avatarJohn David Anglin <dave.anglin@bell.net>
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      93a8347f
    • Helge Deller's avatar
      parisc: Fix CPU affinity for Lasi, WAX and Dino chips · f77f482e
      Helge Deller authored
      [ Upstream commit 939fc856
      
       ]
      
      Add the missing logic to allow Lasi, WAX and Dino to set the
      CPU affinity. This fixes IRQ migration to other CPUs when a
      CPU is shutdown which currently holds the IRQs for one of those
      chips.
      
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      f77f482e
    • Naresh Kamboju's avatar
      selftests: net: Add tls config dependency for tls selftests · 30dd4af4
      Naresh Kamboju authored
      [ Upstream commit d9142e1c
      
       ]
      
      selftest net tls test cases need TLS=m without this the test hangs.
      Enabling config TLS solves this problem and runs to complete.
        - CONFIG_TLS=m
      
      Reported-by: default avatarLinux Kernel Functional Testing <lkft@linaro.org>
      Signed-off-by: default avatarNaresh Kamboju <naresh.kamboju@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      30dd4af4
    • Trond Myklebust's avatar
      NFS: Avoid writeback threads getting stuck in mempool_alloc() · ea029e4c
      Trond Myklebust authored
      [ Upstream commit 0bae835b
      
       ]
      
      In a low memory situation, allow the NFS writeback code to fail without
      getting stuck in infinite loops in mempool_alloc().
      
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      ea029e4c
    • Trond Myklebust's avatar
      NFS: nfsiod should not block forever in mempool_alloc() · da747de6
      Trond Myklebust authored
      [ Upstream commit 515dcdcd
      
       ]
      
      The concern is that since nfsiod is sometimes required to kick off a
      commit, it can get locked up waiting forever in mempool_alloc() instead
      of failing gracefully and leaving the commit until later.
      
      Try to allocate from the slab first, with GFP_KERNEL | __GFP_NORETRY,
      then fall back to a non-blocking attempt to allocate from the memory
      pool.
      
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      da747de6
    • Trond Myklebust's avatar
      SUNRPC: Fix socket waits for write buffer space · e04ef859
      Trond Myklebust authored
      [ Upstream commit 7496b59f
      
       ]
      
      The socket layer requires that we use the socket lock to protect changes
      to the sock->sk_write_pending field and others.
      
      Reported-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      e04ef859
    • Haimin Zhang's avatar
      jfs: prevent NULL deref in diFree · d925b7e7
      Haimin Zhang authored
      [ Upstream commit a5304629
      
       ]
      
      Add validation check for JFS_IP(ipimap)->i_imap to prevent a NULL deref
      in diFree since diFree uses it without do any validations.
      When function jfs_mount calls diMount to initialize fileset inode
      allocation map, it can fail and JFS_IP(ipimap)->i_imap won't be
      initialized. Then it calls diFreeSpecial to close fileset inode allocation
      map inode and it will flow into jfs_evict_inode. Function jfs_evict_inode
      just validates JFS_SBI(inode->i_sb)->ipimap, then calls diFree. diFree use
      JFS_IP(ipimap)->i_imap directly, then it will cause a NULL deref.
      
      Reported-by: default avatarTCS Robot <tcs_robot@tencent.com>
      Signed-off-by: default avatarHaimin Zhang <tcs_kernel@tencent.com>
      Signed-off-by: default avatarDave Kleikamp <dave.kleikamp@oracle.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      d925b7e7
    • Randy Dunlap's avatar
      virtio_console: eliminate anonymous module_init & module_exit · 44c2d5fb
      Randy Dunlap authored
      [ Upstream commit fefb8a2a
      
       ]
      
      Eliminate anonymous module_init() and module_exit(), which can lead to
      confusion or ambiguity when reading System.map, crashes/oops/bugs,
      or an initcall_debug log.
      
      Give each of these init and exit functions unique driver-specific
      names to eliminate the anonymous names.
      
      Example 1: (System.map)
       ffffffff832fc78c t init
       ffffffff832fc79e t init
       ffffffff832fc8f8 t init
      
      Example 2: (initcall_debug log)
       calling  init+0x0/0x12 @ 1
       initcall init+0x0/0x12 returned 0 after 15 usecs
       calling  init+0x0/0x60 @ 1
       initcall init+0x0/0x60 returned 0 after 2 usecs
       calling  init+0x0/0x9a @ 1
       initcall init+0x0/0x9a returned 0 after 74 usecs
      
      Signed-off-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Reviewed-by: default avatarAmit Shah <amit@kernel.org>
      Cc: virtualization@lists.linux-foundation.org
      Cc: Arnd Bergmann <arnd@arndb.de>
      Link: https://lore.kernel.org/r/20220316192010.19001-3-rdunlap@infradead.org
      
      
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      44c2d5fb
    • Jiri Slaby's avatar
      serial: samsung_tty: do not unlock port->lock for uart_write_wakeup() · 053bbff8
      Jiri Slaby authored
      [ Upstream commit 988c7c00 ]
      
      The commit c15c3747
      
       (serial: samsung: fix potential soft lockup
      during uart write) added an unlock of port->lock before
      uart_write_wakeup() and a lock after it. It was always problematic to
      write data from tty_ldisc_ops::write_wakeup and it was even documented
      that way. We fixed the line disciplines to conform to this recently.
      So if there is still a missed one, we should fix them instead of this
      workaround.
      
      On the top of that, s3c24xx_serial_tx_dma_complete() in this driver
      still holds the port->lock while calling uart_write_wakeup().
      
      So revert the wrap added by the commit above.
      
      Cc: Thomas Abraham <thomas.abraham@linaro.org>
      Cc: Kyungmin Park <kyungmin.park@samsung.com>
      Cc: Hyeonkook Kim <hk619.kim@samsung.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      Link: https://lore.kernel.org/r/20220308115153.4225-1-jslaby@suse.cz
      
      
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      053bbff8
    • Nathan Chancellor's avatar
      x86/Kconfig: Do not allow CONFIG_X86_X32_ABI=y with llvm-objcopy · c393a9f4
      Nathan Chancellor authored
      [ Upstream commit aaeed6ec ]
      
      There are two outstanding issues with CONFIG_X86_X32_ABI and
      llvm-objcopy, with similar root causes:
      
      1. llvm-objcopy does not properly convert .note.gnu.property when going
         from x86_64 to x86_x32, resulting in a corrupted section when
         linking:
      
         https://github.com/ClangBuiltLinux/linux/issues/1141
      
      2. llvm-objcopy produces corrupted compressed debug sections when going
         from x86_64 to x86_x32, also resulting in an error when linking:
      
         https://github.com/ClangBuiltLinux/linux/issues/514
      
      
      
      After commit 41c5ef31ad71 ("x86/ibt: Base IBT bits"), the
      .note.gnu.property section is always generated when
      CONFIG_X86_KERNEL_IBT is enabled, which causes the first issue to become
      visible with an allmodconfig build:
      
        ld.lld: error: arch/x86/entry/vdso/vclock_gettime-x32.o:(.note.gnu.property+0x1c): program property is too short
      
      To avoid this error, do not allow CONFIG_X86_X32_ABI to be selected when
      using llvm-objcopy. If the two issues ever get fixed in llvm-objcopy,
      this can be turned into a feature check.
      
      Signed-off-by: default avatarNathan Chancellor <nathan@kernel.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lore.kernel.org/r/20220314194842.3452-3-nathan@kernel.org
      
      
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      c393a9f4
    • Peter Zijlstra's avatar
      x86: Annotate call_on_stack() · e3c961c5
      Peter Zijlstra authored
      [ Upstream commit be007595
      
       ]
      
      vmlinux.o: warning: objtool: page_fault_oops()+0x13c: unreachable instruction
      
      0000 000000000005b460 <page_fault_oops>:
      ...
      0128    5b588:  49 89 23                mov    %rsp,(%r11)
      012b    5b58b:  4c 89 dc                mov    %r11,%rsp
      012e    5b58e:  4c 89 f2                mov    %r14,%rdx
      0131    5b591:  48 89 ee                mov    %rbp,%rsi
      0134    5b594:  4c 89 e7                mov    %r12,%rdi
      0137    5b597:  e8 00 00 00 00          call   5b59c <page_fault_oops+0x13c>    5b598: R_X86_64_PLT32   handle_stack_overflow-0x4
      013c    5b59c:  5c                      pop    %rsp
      
      vmlinux.o: warning: objtool: sysvec_reboot()+0x6d: unreachable instruction
      
      0000 00000000000033f0 <sysvec_reboot>:
      ...
      005d     344d:  4c 89 dc                mov    %r11,%rsp
      0060     3450:  e8 00 00 00 00          call   3455 <sysvec_reboot+0x65>        3451: R_X86_64_PLT32    irq_enter_rcu-0x4
      0065     3455:  48 89 ef                mov    %rbp,%rdi
      0068     3458:  e8 00 00 00 00          call   345d <sysvec_reboot+0x6d>        3459: R_X86_64_PC32     .text+0x47d0c
      006d     345d:  e8 00 00 00 00          call   3462 <sysvec_reboot+0x72>        345e: R_X86_64_PLT32    irq_exit_rcu-0x4
      0072     3462:  5c                      pop    %rsp
      
      Both cases are due to a call_on_stack() calling a __noreturn function.
      Since that's an inline asm, GCC can't do anything about the
      instructions after the CALL. Therefore put in an explicit
      ASM_REACHABLE annotation to make sure objtool and gcc are consistently
      confused about control flow.
      
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Link: https://lore.kernel.org/r/20220308154319.468805622@infradead.org
      
      
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      e3c961c5
    • NeilBrown's avatar
      NFS: swap-out must always use STABLE writes. · 6bb22702
      NeilBrown authored
      [ Upstream commit c265de25
      
       ]
      
      The commit handling code is not safe against memory-pressure deadlocks
      when writing to swap.  In particular, nfs_commitdata_alloc() blocks
      indefinitely waiting for memory, and this can consume all available
      workqueue threads.
      
      swap-out most likely uses STABLE writes anyway as COND_STABLE indicates
      that a stable write should be used if the write fits in a single
      request, and it normally does.  However if we ever swap with a small
      wsize, or gather unusually large numbers of pages for a single write,
      this might change.
      
      For safety, make it explicit in the code that direct writes used for swap
      must always use FLUSH_STABLE.
      
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      6bb22702
    • NeilBrown's avatar
      NFS: swap IO handling is slightly different for O_DIRECT IO · 24d28d9b
      NeilBrown authored
      [ Upstream commit 64158668
      
       ]
      
      1/ Taking the i_rwsem for swap IO triggers lockdep warnings regarding
         possible deadlocks with "fs_reclaim".  These deadlocks could, I believe,
         eventuate if a buffered read on the swapfile was attempted.
      
         We don't need coherence with the page cache for a swap file, and
         buffered writes are forbidden anyway.  There is no other need for
         i_rwsem during direct IO.  So never take it for swap_rw()
      
      2/ generic_write_checks() explicitly forbids writes to swap, and
         performs checks that are not needed for swap.  So bypass it
         for swap_rw().
      
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      24d28d9b
    • NeilBrown's avatar
      SUNRPC: remove scheduling boost for "SWAPPER" tasks. · a5538640
      NeilBrown authored
      [ Upstream commit a80a8461
      
       ]
      
      Currently, tasks marked as "swapper" tasks get put to the front of
      non-priority rpc_queues, and are sorted earlier than non-swapper tasks on
      the transport's ->xmit_queue.
      
      This is pointless as currently *all* tasks for a mount that has swap
      enabled on *any* file are marked as "swapper" tasks.  So the net result
      is that the non-priority rpc_queues are reverse-ordered (LIFO).
      
      This scheduling boost is not necessary to avoid deadlocks, and hurts
      fairness, so remove it.  If there were a need to expedite some requests,
      the tk_priority mechanism is a more appropriate tool.
      
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      a5538640
    • NeilBrown's avatar
      SUNRPC/xprt: async tasks mustn't block waiting for memory · 20700aa0
      NeilBrown authored
      [ Upstream commit a7210354
      
       ]
      
      When memory is short, new worker threads cannot be created and we depend
      on the minimum one rpciod thread to be able to handle everything.  So it
      must not block waiting for memory.
      
      xprt_dynamic_alloc_slot can block indefinitely.  This can tie up all
      workqueue threads and NFS can deadlock.  So when called from a
      workqueue, set __GFP_NORETRY.
      
      The rdma alloc_slot already does not block.  However it sets the error
      to -EAGAIN suggesting this will trigger a sleep.  It does not.  As we
      can see in call_reserveresult(), only -ENOMEM causes a sleep.  -EAGAIN
      causes immediate retry.
      
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      20700aa0
    • NeilBrown's avatar
      SUNRPC/call_alloc: async tasks mustn't block waiting for memory · a19fd1d6
      NeilBrown authored
      [ Upstream commit c487216b
      
       ]
      
      When memory is short, new worker threads cannot be created and we depend
      on the minimum one rpciod thread to be able to handle everything.
      So it must not block waiting for memory.
      
      mempools are particularly a problem as memory can only be released back
      to the mempool by an async rpc task running.  If all available
      workqueue threads are waiting on the mempool, no thread is available to
      return anything.
      
      rpc_malloc() can block, and this might cause deadlocks.
      So check RPC_IS_ASYNC(), rather than RPC_IS_SWAPPER() to determine if
      blocking is acceptable.
      
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      a19fd1d6
    • Maxime Ripard's avatar
      clk: Enforce that disjoints limits are invalid · b07387c4
      Maxime Ripard authored
      [ Upstream commit 10c46f2e
      
       ]
      
      If we were to have two users of the same clock, doing something like:
      
      clk_set_rate_range(user1, 1000, 2000);
      clk_set_rate_range(user2, 3000, 4000);
      
      The second call would fail with -EINVAL, preventing from getting in a
      situation where we end up with impossible limits.
      
      However, this is never explicitly checked against and enforced, and
      works by relying on an undocumented behaviour of clk_set_rate().
      
      Indeed, on the first clk_set_rate_range will make sure the current clock
      rate is within the new range, so it will be between 1000 and 2000Hz. On
      the second clk_set_rate_range(), it will consider (rightfully), that our
      current clock is outside of the 3000-4000Hz range, and will call
      clk_core_set_rate_nolock() to set it to 3000Hz.
      
      clk_core_set_rate_nolock() will then call clk_calc_new_rates() that will
      eventually check that our rate 3000Hz rate is outside the min 3000Hz max
      2000Hz range, will bail out, the error will propagate and we'll
      eventually return -EINVAL.
      
      This solely relies on the fact that clk_calc_new_rates(), and in
      particular clk_core_determine_round_nolock(), won't modify the new rate
      allowing the error to be reported. That assumption won't be true for all
      drivers, and most importantly we'll break that assumption in a later
      patch.
      
      It can also be argued that we shouldn't even reach the point where we're
      calling clk_core_set_rate_nolock().
      
      Let's make an explicit check for disjoints range before we're doing
      anything.
      
      Signed-off-by: default avatarMaxime Ripard <maxime@cerno.tech>
      Link: https://lore.kernel.org/r/20220225143534.405820-4-maxime@cerno.tech
      
      
      Signed-off-by: default avatarStephen Boyd <sboyd@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      b07387c4
    • Tony Lindgren's avatar
      clk: ti: Preserve node in ti_dt_clocks_register() · 15bfec9d
      Tony Lindgren authored
      [ Upstream commit 80864594
      
       ]
      
      In preparation for making use of the clock-output-names, we want to
      keep node around in ti_dt_clocks_register().
      
      This change should not needed as a fix currently.
      
      Signed-off-by: default avatarTony Lindgren <tony@atomide.com>
      Link: https://lore.kernel.org/r/20220204071449.16762-3-tony@atomide.com
      
      
      Signed-off-by: default avatarStephen Boyd <sboyd@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      15bfec9d