Skip to content
  1. Apr 14, 2022
    • Eyal Birger's avatar
      vrf: fix packet sniffing for traffic originating from ip tunnels · 13bcc6f8
      Eyal Birger authored
      [ Upstream commit 012d69fb ]
      
      in commit 04893908
      ("vrf: add mac header for tunneled packets when sniffer is attached")
      an Ethernet header was cooked for traffic originating from tunnel devices.
      
      However, the header is added based on whether the mac_header is unset
      and ignores cases where the device doesn't expose a mac header to upper
      layers, such as in ip tunnels like ipip and gre.
      
      Traffic originating from such devices still appears garbled when capturing
      on the vrf device.
      
      Fix by observing whether the original device exposes a header to upper
      layers, similar to the logic done in af_packet.
      
      In addition, skb->mac_len needs to be adjusted after adding the Ethernet
      header for the skb_push/pull() surrounding dev_queue_xmit_nit() to work
      on these packets.
      
      Fixes: 04893908
      
       ("vrf: add mac header for tunneled packets when sniffer is attached")
      Signed-off-by: default avatarEyal Birger <eyal.birger@gmail.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      13bcc6f8
    • Ziyang Xuan's avatar
      net/tls: fix slab-out-of-bounds bug in decrypt_internal · 6e2f1b03
      Ziyang Xuan authored
      [ Upstream commit 9381fe8c ]
      
      The memory size of tls_ctx->rx.iv for AES128-CCM is 12 setting in
      tls_set_sw_offload(). The return value of crypto_aead_ivsize()
      for "ccm(aes)" is 16. So memcpy() require 16 bytes from 12 bytes
      memory space will trigger slab-out-of-bounds bug as following:
      
      ==================================================================
      BUG: KASAN: slab-out-of-bounds in decrypt_internal+0x385/0xc40 [tls]
      Read of size 16 at addr ffff888114e84e60 by task tls/10911
      
      Call Trace:
       <TASK>
       dump_stack_lvl+0x34/0x44
       print_report.cold+0x5e/0x5db
       ? decrypt_internal+0x385/0xc40 [tls]
       kasan_report+0xab/0x120
       ? decrypt_internal+0x385/0xc40 [tls]
       kasan_check_range+0xf9/0x1e0
       memcpy+0x20/0x60
       decrypt_internal+0x385/0xc40 [tls]
       ? tls_get_rec+0x2e0/0x2e0 [tls]
       ? process_rx_list+0x1a5/0x420 [tls]
       ? tls_setup_from_iter.constprop.0+0x2e0/0x2e0 [tls]
       decrypt_skb_update+0x9d/0x400 [tls]
       tls_sw_recvmsg+0x3c8/0xb50 [tls]
      
      Allocated by task 10911:
       kasan_save_stack+0x1e/0x40
       __kasan_kmalloc+0x81/0xa0
       tls_set_sw_offload+0x2eb/0xa20 [tls]
       tls_setsockopt+0x68c/0x700 [tls]
       __sys_setsockopt+0xfe/0x1b0
      
      Replace the crypto_aead_ivsize() with prot->iv_size + prot->salt_size
      when memcpy() iv value in TLS_1_3_VERSION scenario.
      
      Fixes: f295b3ae
      
       ("net/tls: Add support of AES128-CCM based ciphers")
      Signed-off-by: default avatarZiyang Xuan <william.xuanziyang@huawei.com>
      Reviewed-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      6e2f1b03
    • Taehee Yoo's avatar
      net: sfc: add missing xdp queue reinitialization · ed7a824f
      Taehee Yoo authored
      [ Upstream commit 059a47f1 ]
      
      After rx/tx ring buffer size is changed, kernel panic occurs when
      it acts XDP_TX or XDP_REDIRECT.
      
      When tx/rx ring buffer size is changed(ethtool -G), sfc driver
      reallocates and reinitializes rx and tx queues and their buffer
      (tx_queue->buffer).
      But it misses reinitializing xdp queues(efx->xdp_tx_queues).
      So, while it is acting XDP_TX or XDP_REDIRECT, it uses the uninitialized
      tx_queue->buffer.
      
      A new function efx_set_xdp_channels() is separated from efx_set_channels()
      to handle only xdp queues.
      
      Splat looks like:
         BUG: kernel NULL pointer dereference, address: 000000000000002a
         #PF: supervisor write access in kernel mode
         #PF: error_code(0x0002) - not-present page
         PGD 0 P4D 0
         Oops: 0002 [#4] PREEMPT SMP NOPTI
         RIP: 0010:efx_tx_map_chunk+0x54/0x90 [sfc]
         CPU: 2 PID: 0 Comm: swapper/2 Tainted: G      D           5.17.0+ #55 e8beeee8289528f11357029357cf
         Code: 48 8b 8d a8 01 00 00 48 8d 14 52 4c 8d 2c d0 44 89 e0 48 85 c9 74 0e 44 89 e2 4c 89 f6 48 80
         RSP: 0018:ffff92f121e45c60 EFLAGS: 00010297
         RIP: 0010:efx_tx_map_chunk+0x54/0x90 [sfc]
         RAX: 0000000000000040 RBX: ffff92ea506895c0 RCX: ffffffffc0330870
         RDX: 0000000000000001 RSI: 00000001139b10ce RDI: ffff92ea506895c0
         RBP: ffffffffc0358a80 R08: 00000001139b110d R09: 0000000000000000
         R10: 0000000000000001 R11: ffff92ea414c0088 R12: 0000000000000040
         R13: 0000000000000018 R14: 00000001139b10ce R15: ffff92ea506895c0
         FS:  0000000000000000(0000) GS:ffff92f121ec0000(0000) knlGS:0000000000000000
         CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
         Code: 48 8b 8d a8 01 00 00 48 8d 14 52 4c 8d 2c d0 44 89 e0 48 85 c9 74 0e 44 89 e2 4c 89 f6 48 80
         CR2: 000000000000002a CR3: 00000003e6810004 CR4: 00000000007706e0
         RSP: 0018:ffff92f121e85c60 EFLAGS: 00010297
         PKRU: 55555554
         RAX: 0000000000000040 RBX: ffff92ea50689700 RCX: ffffffffc0330870
         RDX: 0000000000000001 RSI: 00000001145a90ce RDI: ffff92ea50689700
         RBP: ffffffffc0358a80 R08: 00000001145a910d R09: 0000000000000000
         R10: 0000000000000001 R11: ffff92ea414c0088 R12: 0000000000000040
         R13: 0000000000000018 R14: 00000001145a90ce R15: ffff92ea50689700
         FS:  0000000000000000(0000) GS:ffff92f121e80000(0000) knlGS:0000000000000000
         CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
         CR2: 000000000000002a CR3: 00000003e6810005 CR4: 00000000007706e0
         PKRU: 55555554
         Call Trace:
          <IRQ>
          efx_xdp_tx_buffers+0x12b/0x3d0 [sfc 84c94b8e32d44d296c17e10a634d3ad454de4ba5]
          __efx_rx_packet+0x5c3/0x930 [sfc 84c94b8e32d44d296c17e10a634d3ad454de4ba5]
          efx_rx_packet+0x28c/0x2e0 [sfc 84c94b8e32d44d296c17e10a634d3ad454de4ba5]
          efx_ef10_ev_process+0x5f8/0xf40 [sfc 84c94b8e32d44d296c17e10a634d3ad454de4ba5]
          ? enqueue_task_fair+0x95/0x550
          efx_poll+0xc4/0x360 [sfc 84c94b8e32d44d296c17e10a634d3ad454de4ba5]
      
      Fixes: 3990a8ff
      
       ("sfc: allocate channels for XDP tx queues")
      Signed-off-by: default avatarTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      ed7a824f
    • Jason Wang's avatar
      vdpa: mlx5: prevent cvq work from hogging CPU · 69ec350a
      Jason Wang authored
      [ Upstream commit 55ebf0d6 ]
      
      A userspace triggerable infinite loop could happen in
      mlx5_cvq_kick_handler() if userspace keeps sending a huge amount of
      cvq requests.
      
      Fixing this by introducing a quota and re-queue the work if we're out
      of the budget (currently the implicit budget is one) . While at it,
      using a per device work struct to avoid on demand memory allocation
      for cvq.
      
      Fixes: 5262912e
      
       ("vdpa/mlx5: Add support for control VQ and MAC setting")
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      Link: https://lore.kernel.org/r/20220329042109.4029-1-jasowang@redhat.com
      
      
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Acked-by: default avatarEli Cohen <elic@nvidia.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      69ec350a
    • Eli Cohen's avatar
      vdpa/mlx5: Propagate link status from device to vdpa driver · 893c70f8
      Eli Cohen authored
      [ Upstream commit edf747af
      
       ]
      
      Add code to register to hardware asynchronous events. Use this
      mechanism to track link status events coming from the device and update
      the config struct.
      
      After doing link status change, call the vdpa callback to notify of the
      link status change.
      
      Signed-off-by: default avatarEli Cohen <elic@nvidia.com>
      Link: https://lore.kernel.org/r/20210909123635.30884-4-elic@nvidia.com
      
      
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Acked-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      893c70f8
    • Eli Cohen's avatar
      vdpa/mlx5: Rename control VQ workqueue to vdpa wq · dc872b72
      Eli Cohen authored
      [ Upstream commit 218bdd20
      
       ]
      
      A subesequent patch will use the same workqueue for executing other
      work not related to control VQ. Rename the workqueue and the work queue
      entry used to convey information to the workqueue.
      
      Signed-off-by: default avatarEli Cohen <elic@nvidia.com>
      Link: https://lore.kernel.org/r/20210909123635.30884-3-elic@nvidia.com
      
      
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Acked-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      dc872b72
    • Christophe JAILLET's avatar
      scsi: zorro7xx: Fix a resource leak in zorro7xx_remove_one() · aefd755a
      Christophe JAILLET authored
      [ Upstream commit 16ed828b ]
      
      The error handling path of the probe releases a resource that is not freed
      in the remove function. In some cases, a ioremap() must be undone.
      
      Add the missing iounmap() call in the remove function.
      
      Link: https://lore.kernel.org/r/247066a3104d25f9a05de8b3270fc3c848763bcc.1647673264.git.christophe.jaillet@wanadoo.fr
      Fixes: 45804fbb
      
       ("[SCSI] 53c700: Amiga Zorro NCR53c710 SCSI")
      Reviewed-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: default avatarChristophe JAILLET <christophe.jaillet@wanadoo.fr>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      aefd755a
    • John Garry's avatar
      scsi: core: Fix sbitmap depth in scsi_realloc_sdev_budget_map() · cd483e17
      John Garry authored
      [ Upstream commit eaba83b5 ]
      
      In commit edb854a3 ("scsi: core: Reallocate device's budget map on
      queue depth change"), the sbitmap for the device budget map may be
      reallocated after the slave device depth is configured.
      
      When the sbitmap is reallocated we use the result from
      scsi_device_max_queue_depth() for the sbitmap size, but don't resize to
      match the actual device queue depth.
      
      Fix by resizing the sbitmap after reallocating the budget sbitmap. We do
      this instead of init'ing the sbitmap to the device queue depth as the user
      may want to change the queue depth later via sysfs or other.
      
      Link: https://lore.kernel.org/r/1647423870-143867-1-git-send-email-john.garry@huawei.com
      Fixes: edb854a3
      
       ("scsi: core: Reallocate device's budget map on queue depth change")
      Tested-by: default avatarDamien Le Moal <damien.lemoal@opensource.wdc.com>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Reviewed-by: default avatarBart Van Assche <bvanassche@acm.org>
      Signed-off-by: default avatarJohn Garry <john.garry@huawei.com>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      cd483e17
    • Kevin Groeneveld's avatar
      scsi: sr: Fix typo in CDROM(CLOSETRAY|EJECT) handling · 0610371c
      Kevin Groeneveld authored
      [ Upstream commit bc5519c1 ]
      
      Commit 2e27f576 ("scsi: scsi_ioctl: Call scsi_cmd_ioctl() from
      scsi_ioctl()") seems to have a typo as it is checking ret instead of cmd in
      the if statement checking for CDROMCLOSETRAY and CDROMEJECT.  This changes
      the behaviour of these ioctls as the cdrom_ioctl handling of these is more
      restrictive than the scsi_ioctl version.
      
      Link: https://lore.kernel.org/r/20220323002242.21157-1-kgroeneveld@lenbrook.com
      Fixes: 2e27f576
      
       ("scsi: scsi_ioctl: Call scsi_cmd_ioctl() from scsi_ioctl()")
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarKevin Groeneveld <kgroeneveld@lenbrook.com>
      Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      0610371c
    • ChenXiaoSong's avatar
      NFSv4: fix open failure with O_ACCMODE flag · 6f52d4cd
      ChenXiaoSong authored
      [ Upstream commit b243874f ]
      
      open() with O_ACCMODE|O_DIRECT flags secondly will fail.
      
      Reproducer:
        1. mount -t nfs -o vers=4.2 $server_ip:/ /mnt/
        2. fd = open("/mnt/file", O_ACCMODE|O_DIRECT|O_CREAT)
        3. close(fd)
        4. fd = open("/mnt/file", O_ACCMODE|O_DIRECT)
      
      Server nfsd4_decode_share_access() will fail with error nfserr_bad_xdr when
      client use incorrect share access mode of 0.
      
      Fix this by using NFS4_SHARE_ACCESS_BOTH share access mode in client,
      just like firstly opening.
      
      Fixes: ce4ef7c0
      
       ("NFS: Split out NFS v4 file operations")
      Signed-off-by: default avatarChenXiaoSong <chenxiaosong2@huawei.com>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      6f52d4cd
    • ChenXiaoSong's avatar
      Revert "NFSv4: Handle the special Linux file open access mode" · 9f0c2174
      ChenXiaoSong authored
      [ Upstream commit ab0fc21b ]
      
      This reverts commit 44942b4e
      
      .
      
      After secondly opening a file with O_ACCMODE|O_DIRECT flags,
      nfs4_valid_open_stateid() will dereference NULL nfs4_state when lseek().
      
      Reproducer:
        1. mount -t nfs -o vers=4.2 $server_ip:/ /mnt/
        2. fd = open("/mnt/file", O_ACCMODE|O_DIRECT|O_CREAT)
        3. close(fd)
        4. fd = open("/mnt/file", O_ACCMODE|O_DIRECT)
        5. lseek(fd)
      
      Reported-by: default avatarLyu Tao <tao.lyu@epfl.ch>
      Signed-off-by: default avatarChenXiaoSong <chenxiaosong2@huawei.com>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      9f0c2174
    • Guilherme G. Piccoli's avatar
      Drivers: hv: vmbus: Fix potential crash on module unload · dcd6b1a6
      Guilherme G. Piccoli authored
      [ Upstream commit 792f232d ]
      
      The vmbus driver relies on the panic notifier infrastructure to perform
      some operations when a panic event is detected. Since vmbus can be built
      as module, it is required that the driver handles both registering and
      unregistering such panic notifier callback.
      
      After commit 74347a99 ("x86/Hyper-V: Unload vmbus channel in hv panic callback")
      though, the panic notifier registration is done unconditionally in the module
      initialization routine whereas the unregistering procedure is conditionally
      guarded and executes only if HV_FEATURE_GUEST_CRASH_MSR_AVAILABLE capability
      is set.
      
      This patch fixes that by unconditionally unregistering the panic notifier
      in the module's exit routine as well.
      
      Fixes: 74347a99
      
       ("x86/Hyper-V: Unload vmbus channel in hv panic callback")
      Signed-off-by: default avatarGuilherme G. Piccoli <gpiccoli@igalia.com>
      Reviewed-by: default avatarMichael Kelley <mikelley@microsoft.com>
      Link: https://lore.kernel.org/r/20220315203535.682306-1-gpiccoli@igalia.com
      
      
      Signed-off-by: default avatarWei Liu <wei.liu@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      dcd6b1a6
    • Dan Carpenter's avatar
      drm/amdgpu: fix off by one in amdgpu_gfx_kiq_acquire() · 5ba9d78a
      Dan Carpenter authored
      [ Upstream commit 1647b54e ]
      
      This post-op should be a pre-op so that we do not pass -1 as the bit
      number to test_bit().  The current code will loop downwards from 63 to
      -1.  After changing to a pre-op, it loops from 63 to 0.
      
      Fixes: 71c37505
      
       ("drm/amdgpu/gfx: move more common KIQ code to amdgpu_gfx.c")
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      5ba9d78a
    • Mateusz Jończyk's avatar
      rtc: mc146818-lib: fix RTC presence check · 985d87e6
      Mateusz Jończyk authored
      [ Upstream commit ea6fa496 ]
      
      To prevent an infinite loop in mc146818_get_time(),
      commit 211e5db1 ("rtc: mc146818: Detect and handle broken RTCs")
      added a check for RTC availability. Together with a later fix, it
      checked if bit 6 in register 0x0d is cleared.
      
      This, however, caused a false negative on a motherboard with an AMD
      SB710 southbridge; according to the specification [1], bit 6 of register
      0x0d of this chipset is a scratchbit. This caused a regression in Linux
      5.11 - the RTC was determined broken by the kernel and not used by
      rtc-cmos.c [3]. This problem was also reported in Fedora [4].
      
      As a better alternative, check whether the UIP ("Update-in-progress")
      bit is set for longer then 10ms. If that is the case, then apparently
      the RTC is either absent (and all register reads return 0xff) or broken.
      Also limit the number of loop iterations in mc146818_get_time() to 10 to
      prevent an infinite loop there.
      
      The functions mc146818_get_time() and mc146818_does_rtc_work() will be
      refactored later in this patch series, in order to fix a separate
      problem with reading / setting the RTC alarm time. This is done so to
      avoid a confusion about what is being fixed when.
      
      In a previous approach to this problem, I implemented a check whether
      the RTC_HOURS register contains a value <= 24. This, however, sometimes
      did not work correctly on my Intel Kaby Lake laptop. According to
      Intel's documentation [2], "the time and date RAM locations (0-9) are
      disconnected from the external bus" during the update cycle so reading
      this register without checking the UIP bit is incorrect.
      
      [1] AMD SB700/710/750 Register Reference Guide, page 308,
      https://developer.amd.com/wordpress/media/2012/10/43009_sb7xx_rrg_pub_1.00.pdf
      
      [2] 7th Generation Intel ® Processor Family I/O for U/Y Platforms [...] Datasheet
      Volume 1 of 2, page 209
      Intel's Document Number: 334658-006,
      https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/7th-and-8th-gen-core-family-mobile-u-y-processor-lines-i-o-datasheet-vol-1.pdf
      
      [3] Functions in arch/x86/kernel/rtc.c apparently were using it.
      
      [4] https://bugzilla.redhat.com/show_bug.cgi?id=1936688
      
      Fixes: 211e5db1 ("rtc: mc146818: Detect and handle broken RTCs")
      Fixes: ebb22a05
      
       ("rtc: mc146818: Dont test for bit 0-5 in Register D")
      Signed-off-by: default avatarMateusz Jończyk <mat.jonczyk@o2.pl>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Alessandro Zummo <a.zummo@towertech.it>
      Cc: Alexandre Belloni <alexandre.belloni@bootlin.com>
      Signed-off-by: default avatarAlexandre Belloni <alexandre.belloni@bootlin.com>
      Link: https://lore.kernel.org/r/20211210200131.153887-5-mat.jonczyk@o2.pl
      
      
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      985d87e6
    • Mateusz Jończyk's avatar
      rtc: Check return value from mc146818_get_time() · be6c3152
      Mateusz Jończyk authored
      [ Upstream commit 0dd8d6cb
      
       ]
      
      There are 4 users of mc146818_get_time() and none of them was checking
      the return value from this function. Change this.
      
      Print the appropriate warnings in callers of mc146818_get_time() instead
      of in the function mc146818_get_time() itself, in order not to add
      strings to rtc-mc146818-lib.c, which is kind of a library.
      
      The callers of alpha_rtc_read_time() and cmos_read_time() may use the
      contents of (struct rtc_time *) even when the functions return a failure
      code. Therefore, set the contents of (struct rtc_time *) to 0x00,
      which looks more sensible then 0xff and aligns with the (possibly
      stale?) comment in cmos_read_time:
      
      	/*
      	 * If pm_trace abused the RTC for storage, set the timespec to 0,
      	 * which tells the caller that this RTC value is unusable.
      	 */
      
      For consistency, do this in mc146818_get_time().
      
      Note: hpet_rtc_interrupt() may call mc146818_get_time() many times a
      second. It is very unlikely, though, that the RTC suddenly stops
      working and mc146818_get_time() would consistently fail.
      
      Only compile-tested on alpha.
      
      Signed-off-by: default avatarMateusz Jończyk <mat.jonczyk@o2.pl>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Alessandro Zummo <a.zummo@towertech.it>
      Cc: Alexandre Belloni <alexandre.belloni@bootlin.com>
      Cc: linux-alpha@vger.kernel.org
      Cc: x86@kernel.org
      Signed-off-by: default avatarAlexandre Belloni <alexandre.belloni@bootlin.com>
      Link: https://lore.kernel.org/r/20211210200131.153887-4-mat.jonczyk@o2.pl
      
      
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      be6c3152
    • Mateusz Jończyk's avatar
      rtc: mc146818-lib: change return values of mc146818_get_time() · 8c692107
      Mateusz Jończyk authored
      [ Upstream commit d35786b3
      
       ]
      
      No function is checking mc146818_get_time() return values yet, so
      correct them to make them more customary.
      
      Signed-off-by: default avatarMateusz Jończyk <mat.jonczyk@o2.pl>
      Cc: Alessandro Zummo <a.zummo@towertech.it>
      Cc: Alexandre Belloni <alexandre.belloni@bootlin.com>
      Signed-off-by: default avatarAlexandre Belloni <alexandre.belloni@bootlin.com>
      Link: https://lore.kernel.org/r/20211210200131.153887-3-mat.jonczyk@o2.pl
      
      
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      8c692107
    • Mauricio Faria de Oliveira's avatar
      mm: fix race between MADV_FREE reclaim and blkdev direct IO read · c9f50e06
      Mauricio Faria de Oliveira authored
      commit 6c8e2a25 upstream.
      
      Problem:
      =======
      
      Userspace might read the zero-page instead of actual data from a direct IO
      read on a block device if the buffers have been called madvise(MADV_FREE)
      on earlier (this is discussed below) due to a race between page reclaim on
      MADV_FREE and blkdev direct IO read.
      
      - Race condition:
        ==============
      
      During page reclaim, the MADV_FREE page check in try_to_unmap_one() checks
      if the page is not dirty, then discards its rmap PTE(s) (vs.  remap back
      if the page is dirty).
      
      However, after try_to_unmap_one() returns to shrink_page_list(), it might
      keep the page _anyway_ if page_ref_freeze() fails (it expects exactly
      _one_ page reference, from the isolation for page reclaim).
      
      Well, blkdev_direct_IO() gets references for all pages, and on READ
      operations it only sets them dirty _later_.
      
      So, if MADV_FREE'd pages (i.e., not dirty) are used as buffers for direct
      IO read from block devices, and page reclaim happens during
      __blkdev_direct_IO[_simple]() exactly AFTER bio_iov_iter_get_pages()
      returns, but BEFORE the pages are set dirty, the situation happens.
      
      The direct IO read eventually completes.  Now, when userspace reads the
      buffers, the PTE is no longer there and the page fault handler
      do_anonymous_page() services that with the zero-page, NOT the data!
      
      A synthetic reproducer is provided.
      
      - Page faults:
        ===========
      
      If page reclaim happens BEFORE bio_iov_iter_get_pages() the issue doesn't
      happen, because that faults-in all pages as writeable, so
      do_anonymous_page() sets up a new page/rmap/PTE, and that is used by
      direct IO.  The userspace reads don't fault as the PTE is there (thus
      zero-page is not used/setup).
      
      But if page reclaim happens AFTER it / BEFORE setting pages dirty, the PTE
      is no longer there; the subsequent page faults can't help:
      
      The data-read from the block device probably won't generate faults due to
      DMA (no MMU) but even in the case it wouldn't use DMA, that happens on
      different virtual addresses (not user-mapped addresses) because `struct
      bio_vec` stores `struct page` to figure addresses out (which are different
      from user-mapped addresses) for the read.
      
      Thus userspace reads (to user-mapped addresses) still fault, then
      do_anonymous_page() gets another `struct page` that would address/ map to
      other memory than the `struct page` used by `struct bio_vec` for the read.
      (The original `struct page` is not available, since it wasn't freed, as
      page_ref_freeze() failed due to more page refs.  And even if it were
      available, its data cannot be trusted anymore.)
      
      Solution:
      ========
      
      One solution is to check for the expected page reference count in
      try_to_unmap_one().
      
      There should be one reference from the isolation (that is also checked in
      shrink_page_list() with page_ref_freeze()) plus one or more references
      from page mapping(s) (put in discard: label).  Further references mean
      that rmap/PTE cannot be unmapped/nuked.
      
      (Note: there might be more than one reference from mapping due to
      fork()/clone() without CLONE_VM, which use the same `struct page` for
      references, until the copy-on-write page gets copied.)
      
      So, additional page references (e.g., from direct IO read) now prevent the
      rmap/PTE from being unmapped/dropped; similarly to the page is not freed
      per shrink_page_list()/page_ref_freeze()).
      
      - Races and Barriers:
        ==================
      
      The new check in try_to_unmap_one() should be safe in races with
      bio_iov_iter_get_pages() in get_user_pages() fast and slow paths, as it's
      done under the PTE lock.
      
      The fast path doesn't take the lock, but it checks if the PTE has changed
      and if so, it drops the reference and leaves the page for the slow path
      (which does take that lock).
      
      The fast path requires synchronization w/ full memory barrier: it writes
      the page reference count first then it reads the PTE later, while
      try_to_unmap() writes PTE first then it reads page refcount.
      
      And a second barrier is needed, as the page dirty flag should not be read
      before the page reference count (as in __remove_mapping()).  (This can be
      a load memory barrier only; no writes are involved.)
      
      Call stack/comments:
      
      - try_to_unmap_one()
        - page_vma_mapped_walk()
          - map_pte()			# see pte_offset_map_lock():
              pte_offset_map()
              spin_lock()
      
        - ptep_get_and_clear()	# write PTE
        - smp_mb()			# (new barrier) GUP fast path
        - page_ref_count()		# (new check) read refcount
      
        - page_vma_mapped_walk_done()	# see pte_unmap_unlock():
            pte_unmap()
            spin_unlock()
      
      - bio_iov_iter_get_pages()
        - __bio_iov_iter_get_pages()
          - iov_iter_get_pages()
            - get_user_pages_fast()
              - internal_get_user_pages_fast()
      
                # fast path
                - lockless_pages_from_mm()
                  - gup_{pgd,p4d,pud,pmd,pte}_range()
                      ptep = pte_offset_map()		# not _lock()
                      pte = ptep_get_lockless(ptep)
      
                      page = pte_page(pte)
                      try_grab_compound_head(page)	# inc refcount
                                                  	# (RMW/barrier
                                                   	#  on success)
      
                      if (pte_val(pte) != pte_val(*ptep)) # read PTE
                              put_compound_head(page) # dec refcount
                              			# go slow path
      
                # slow path
                - __gup_longterm_unlocked()
                  - get_user_pages_unlocked()
                    - __get_user_pages_locked()
                      - __get_user_pages()
                        - follow_{page,p4d,pud,pmd}_mask()
                          - follow_page_pte()
                              ptep = pte_offset_map_lock()
                              pte = *ptep
                              page = vm_normal_page(pte)
                              try_grab_page(page)	# inc refcount
                              pte_unmap_unlock()
      
      - Huge Pages:
        ==========
      
      Regarding transparent hugepages, that logic shouldn't change, as MADV_FREE
      (aka lazyfree) pages are PageAnon() && !PageSwapBacked()
      (madvise_free_pte_range() -> mark_page_lazyfree() -> lru_lazyfree_fn())
      thus should reach shrink_page_list() -> split_huge_page_to_list() before
      try_to_unmap[_one](), so it deals with normal pages only.
      
      (And in case unlikely/TTU_SPLIT_HUGE_PMD/split_huge_pmd_address() happens,
      which should not or be rare, the page refcount should be greater than
      mapcount: the head page is referenced by tail pages.  That also prevents
      checking the head `page` then incorrectly call page_remove_rmap(subpage)
      for a tail page, that isn't even in the shrink_page_list()'s page_list (an
      effect of split huge pmd/pmvw), as it might happen today in this unlikely
      scenario.)
      
      MADV_FREE'd buffers:
      ===================
      
      So, back to the "if MADV_FREE pages are used as buffers" note.  The case
      is arguable, and subject to multiple interpretations.
      
      The madvise(2) manual page on the MADV_FREE advice value says:
      
      1) 'After a successful MADV_FREE ... data will be lost when
         the kernel frees the pages.'
      2) 'the free operation will be canceled if the caller writes
         into the page' / 'subsequent writes ... will succeed and
         then [the] kernel cannot free those dirtied pages'
      3) 'If there is no subsequent write, the kernel can free the
         pages at any time.'
      
      Thoughts, questions, considerations... respectively:
      
      1) Since the kernel didn't actually free the page (page_ref_freeze()
         failed), should the data not have been lost? (on userspace read.)
      2) Should writes performed by the direct IO read be able to cancel
         the free operation?
         - Should the direct IO read be considered as 'the caller' too,
           as it's been requested by 'the caller'?
         - Should the bio technique to dirty pages on return to userspace
           (bio_check_pages_dirty() is called/used by __blkdev_direct_IO())
           be considered in another/special way here?
      3) Should an upcoming write from a previously requested direct IO
         read be considered as a subsequent write, so the kernel should
         not free the pages? (as it's known at the time of page reclaim.)
      
      And lastly:
      
      Technically, the last point would seem a reasonable consideration and
      balance, as the madvise(2) manual page apparently (and fairly) seem to
      assume that 'writes' are memory access from the userspace process (not
      explicitly considering writes from the kernel or its corner cases; again,
      fairly)..  plus the kernel fix implementation for the corner case of the
      largely 'non-atomic write' encompassed by a direct IO read operation, is
      relatively simple; and it helps.
      
      Reproducer:
      ==========
      
      @ test.c (simplified, but works)
      
      	#define _GNU_SOURCE
      	#include <fcntl.h>
      	#include <stdio.h>
      	#include <unistd.h>
      	#include <sys/mman.h>
      
      	int main() {
      		int fd, i;
      		char *buf;
      
      		fd = open(DEV, O_RDONLY | O_DIRECT);
      
      		buf = mmap(NULL, BUF_SIZE, PROT_READ | PROT_WRITE,
                      	   MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
      
      		for (i = 0; i < BUF_SIZE; i += PAGE_SIZE)
      			buf[i] = 1; // init to non-zero
      
      		madvise(buf, BUF_SIZE, MADV_FREE);
      
      		read(fd, buf, BUF_SIZE);
      
      		for (i = 0; i < BUF_SIZE; i += PAGE_SIZE)
      			printf("%p: 0x%x\n", &buf[i], buf[i]);
      
      		return 0;
      	}
      
      @ block/fops.c (formerly fs/block_dev.c)
      
      	+#include <linux/swap.h>
      	...
      	... __blkdev_direct_IO[_simple](...)
      	{
      	...
      	+	if (!strcmp(current->comm, "good"))
      	+		shrink_all_memory(ULONG_MAX);
      	+
               	ret = bio_iov_iter_get_pages(...);
      	+
      	+	if (!strcmp(current->comm, "bad"))
      	+		shrink_all_memory(ULONG_MAX);
      	...
      	}
      
      @ shell
      
              # NUM_PAGES=4
              # PAGE_SIZE=$(getconf PAGE_SIZE)
      
              # yes | dd of=test.img bs=${PAGE_SIZE} count=${NUM_PAGES}
              # DEV=$(losetup -f --show test.img)
      
              # gcc -DDEV=\"$DEV\" \
                    -DBUF_SIZE=$((PAGE_SIZE * NUM_PAGES)) \
                    -DPAGE_SIZE=${PAGE_SIZE} \
                     test.c -o test
      
              # od -tx1 $DEV
              0000000 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a
              *
              0040000
      
              # mv test good
              # ./good
              0x7f7c10418000: 0x79
              0x7f7c10419000: 0x79
              0x7f7c1041a000: 0x79
              0x7f7c1041b000: 0x79
      
              # mv good bad
              # ./bad
              0x7fa1b8050000: 0x0
              0x7fa1b8051000: 0x0
              0x7fa1b8052000: 0x0
              0x7fa1b8053000: 0x0
      
      Note: the issue is consistent on v5.17-rc3, but it's intermittent with the
      support of MADV_FREE on v4.5 (60%-70% error; needs swap).  [wrap
      do_direct_IO() in do_blockdev_direct_IO() @ fs/direct-io.c].
      
      - v5.17-rc3:
      
              # for i in {1..1000}; do ./good; done \
                  | cut -d: -f2 | sort | uniq -c
                 4000  0x79
      
              # mv good bad
              # for i in {1..1000}; do ./bad; done \
                  | cut -d: -f2 | sort | uniq -c
                 4000  0x0
      
              # free | grep Swap
              Swap:             0           0           0
      
      - v4.5:
      
              # for i in {1..1000}; do ./good; done \
                  | cut -d: -f2 | sort | uniq -c
                 4000  0x79
      
              # mv good bad
              # for i in {1..1000}; do ./bad; done \
                  | cut -d: -f2 | sort | uniq -c
                 2702  0x0
                 1298  0x79
      
              # swapoff -av
              swapoff /swap
      
              # for i in {1..1000}; do ./bad; done \
                  | cut -d: -f2 | sort | uniq -c
                 4000  0x79
      
      Ceph/TCMalloc:
      =============
      
      For documentation purposes, the use case driving the analysis/fix is Ceph
      on Ubuntu 18.04, as the TCMalloc library there still uses MADV_FREE to
      release unused memory to the system from the mmap'ed page heap (might be
      committed back/used again; it's not munmap'ed.) - PageHeap::DecommitSpan()
      -> TCMalloc_SystemRelease() -> madvise() - PageHeap::CommitSpan() ->
      TCMalloc_SystemCommit() -> do nothing.
      
      Note: TCMalloc switched back to MADV_DONTNEED a few commits after the
      release in Ubuntu 18.04 (google-perftools/gperftools 2.5), so the issue
      just 'disappeared' on Ceph on later Ubuntu releases but is still present
      in the kernel, and can be hit by other use cases.
      
      The observed issue seems to be the old Ceph bug #22464 [1], where checksum
      mismatches are observed (and instrumentation with buffer dumps shows
      zero-pages read from mmap'ed/MADV_FREE'd page ranges).
      
      The issue in Ceph was reasonably deemed a kernel bug (comment #50) and
      mostly worked around with a retry mechanism, but other parts of Ceph could
      still hit that (rocksdb).  Anyway, it's less likely to be hit again as
      TCMalloc switched out of MADV_FREE by default.
      
      (Some kernel versions/reports from the Ceph bug, and relation with
      the MADV_FREE introduction/changes; TCMalloc versions not checked.)
      - 4.4 good
      - 4.5 (madv_free: introduction)
      - 4.9 bad
      - 4.10 good? maybe a swapless system
      - 4.12 (madv_free: no longer free instantly on swapless systems)
      - 4.13 bad
      
      [1] https://tracker.ceph.com/issues/22464
      
      Thanks:
      ======
      
      Several people contributed to analysis/discussions/tests/reproducers in
      the first stages when drilling down on ceph/tcmalloc/linux kernel:
      
      - Dan Hill
      - Dan Streetman
      - Dongdong Tao
      - Gavin Guo
      - Gerald Yang
      - Heitor Alves de Siqueira
      - Ioanna Alifieraki
      - Jay Vosburgh
      - Matthew Ruffell
      - Ponnuvel Palaniyappan
      
      Reviews, suggestions, corrections, comments:
      
      - Minchan Kim
      - Yu Zhao
      - Huang, Ying
      - John Hubbard
      - Christoph Hellwig
      
      [mfo@canonical.com: v4]
        Link: https://lkml.kernel.org/r/20220209202659.183418-1-mfo@canonical.comLink: https://lkml.kernel.org/r/20220131230255.789059-1-mfo@canonical.com
      
      Fixes: 802a3a92
      
       ("mm: reclaim MADV_FREE pages")
      Signed-off-by: default avatarMauricio Faria de Oliveira <mfo@canonical.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Dan Hill <daniel.hill@canonical.com>
      Cc: Dan Streetman <dan.streetman@canonical.com>
      Cc: Dongdong Tao <dongdong.tao@canonical.com>
      Cc: Gavin Guo <gavin.guo@canonical.com>
      Cc: Gerald Yang <gerald.yang@canonical.com>
      Cc: Heitor Alves de Siqueira <halves@canonical.com>
      Cc: Ioanna Alifieraki <ioanna-maria.alifieraki@canonical.com>
      Cc: Jay Vosburgh <jay.vosburgh@canonical.com>
      Cc: Matthew Ruffell <matthew.ruffell@canonical.com>
      Cc: Ponnuvel Palaniyappan <ponnuvel.palaniyappan@canonical.com>
      Cc: <stable@vger.kernel.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      [mfo: backport: replace folio/test_flag with page/flag equivalents;
       real Fixes: 854e9ed0
      
       ("mm: support madvise(MADV_FREE)") in v4.]
      Signed-off-by: default avatarMauricio Faria de Oliveira <mfo@canonical.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      c9f50e06
    • John David Anglin's avatar
      parisc: Fix patch code locking and flushing · 93a8347f
      John David Anglin authored
      [ Upstream commit a9fe7fa7
      
       ]
      
      This change fixes the following:
      
      1) The flags variable is not initialized. Always use raw_spin_lock_irqsave
      and raw_spin_unlock_irqrestore to serialize patching.
      
      2) flush_kernel_vmap_range is primarily intended for DMA flushes. Since
      __patch_text_multiple is often called with interrupts disabled, it is
      better to directly call flush_kernel_dcache_range_asm and
      flush_kernel_icache_range_asm. This avoids an extra call.
      
      3) The final call to flush_icache_range is unnecessary.
      
      Signed-off-by: default avatarJohn David Anglin <dave.anglin@bell.net>
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      93a8347f
    • Helge Deller's avatar
      parisc: Fix CPU affinity for Lasi, WAX and Dino chips · f77f482e
      Helge Deller authored
      [ Upstream commit 939fc856
      
       ]
      
      Add the missing logic to allow Lasi, WAX and Dino to set the
      CPU affinity. This fixes IRQ migration to other CPUs when a
      CPU is shutdown which currently holds the IRQs for one of those
      chips.
      
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      f77f482e
    • Naresh Kamboju's avatar
      selftests: net: Add tls config dependency for tls selftests · 30dd4af4
      Naresh Kamboju authored
      [ Upstream commit d9142e1c
      
       ]
      
      selftest net tls test cases need TLS=m without this the test hangs.
      Enabling config TLS solves this problem and runs to complete.
        - CONFIG_TLS=m
      
      Reported-by: default avatarLinux Kernel Functional Testing <lkft@linaro.org>
      Signed-off-by: default avatarNaresh Kamboju <naresh.kamboju@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      30dd4af4
    • Trond Myklebust's avatar
      NFS: Avoid writeback threads getting stuck in mempool_alloc() · ea029e4c
      Trond Myklebust authored
      [ Upstream commit 0bae835b
      
       ]
      
      In a low memory situation, allow the NFS writeback code to fail without
      getting stuck in infinite loops in mempool_alloc().
      
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      ea029e4c
    • Trond Myklebust's avatar
      NFS: nfsiod should not block forever in mempool_alloc() · da747de6
      Trond Myklebust authored
      [ Upstream commit 515dcdcd
      
       ]
      
      The concern is that since nfsiod is sometimes required to kick off a
      commit, it can get locked up waiting forever in mempool_alloc() instead
      of failing gracefully and leaving the commit until later.
      
      Try to allocate from the slab first, with GFP_KERNEL | __GFP_NORETRY,
      then fall back to a non-blocking attempt to allocate from the memory
      pool.
      
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      da747de6
    • Trond Myklebust's avatar
      SUNRPC: Fix socket waits for write buffer space · e04ef859
      Trond Myklebust authored
      [ Upstream commit 7496b59f
      
       ]
      
      The socket layer requires that we use the socket lock to protect changes
      to the sock->sk_write_pending field and others.
      
      Reported-by: default avatarChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      e04ef859
    • Haimin Zhang's avatar
      jfs: prevent NULL deref in diFree · d925b7e7
      Haimin Zhang authored
      [ Upstream commit a5304629
      
       ]
      
      Add validation check for JFS_IP(ipimap)->i_imap to prevent a NULL deref
      in diFree since diFree uses it without do any validations.
      When function jfs_mount calls diMount to initialize fileset inode
      allocation map, it can fail and JFS_IP(ipimap)->i_imap won't be
      initialized. Then it calls diFreeSpecial to close fileset inode allocation
      map inode and it will flow into jfs_evict_inode. Function jfs_evict_inode
      just validates JFS_SBI(inode->i_sb)->ipimap, then calls diFree. diFree use
      JFS_IP(ipimap)->i_imap directly, then it will cause a NULL deref.
      
      Reported-by: default avatarTCS Robot <tcs_robot@tencent.com>
      Signed-off-by: default avatarHaimin Zhang <tcs_kernel@tencent.com>
      Signed-off-by: default avatarDave Kleikamp <dave.kleikamp@oracle.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      d925b7e7
    • Randy Dunlap's avatar
      virtio_console: eliminate anonymous module_init & module_exit · 44c2d5fb
      Randy Dunlap authored
      [ Upstream commit fefb8a2a
      
       ]
      
      Eliminate anonymous module_init() and module_exit(), which can lead to
      confusion or ambiguity when reading System.map, crashes/oops/bugs,
      or an initcall_debug log.
      
      Give each of these init and exit functions unique driver-specific
      names to eliminate the anonymous names.
      
      Example 1: (System.map)
       ffffffff832fc78c t init
       ffffffff832fc79e t init
       ffffffff832fc8f8 t init
      
      Example 2: (initcall_debug log)
       calling  init+0x0/0x12 @ 1
       initcall init+0x0/0x12 returned 0 after 15 usecs
       calling  init+0x0/0x60 @ 1
       initcall init+0x0/0x60 returned 0 after 2 usecs
       calling  init+0x0/0x9a @ 1
       initcall init+0x0/0x9a returned 0 after 74 usecs
      
      Signed-off-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Reviewed-by: default avatarAmit Shah <amit@kernel.org>
      Cc: virtualization@lists.linux-foundation.org
      Cc: Arnd Bergmann <arnd@arndb.de>
      Link: https://lore.kernel.org/r/20220316192010.19001-3-rdunlap@infradead.org
      
      
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      44c2d5fb
    • Jiri Slaby's avatar
      serial: samsung_tty: do not unlock port->lock for uart_write_wakeup() · 053bbff8
      Jiri Slaby authored
      [ Upstream commit 988c7c00 ]
      
      The commit c15c3747
      
       (serial: samsung: fix potential soft lockup
      during uart write) added an unlock of port->lock before
      uart_write_wakeup() and a lock after it. It was always problematic to
      write data from tty_ldisc_ops::write_wakeup and it was even documented
      that way. We fixed the line disciplines to conform to this recently.
      So if there is still a missed one, we should fix them instead of this
      workaround.
      
      On the top of that, s3c24xx_serial_tx_dma_complete() in this driver
      still holds the port->lock while calling uart_write_wakeup().
      
      So revert the wrap added by the commit above.
      
      Cc: Thomas Abraham <thomas.abraham@linaro.org>
      Cc: Kyungmin Park <kyungmin.park@samsung.com>
      Cc: Hyeonkook Kim <hk619.kim@samsung.com>
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      Link: https://lore.kernel.org/r/20220308115153.4225-1-jslaby@suse.cz
      
      
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      053bbff8
    • Nathan Chancellor's avatar
      x86/Kconfig: Do not allow CONFIG_X86_X32_ABI=y with llvm-objcopy · c393a9f4
      Nathan Chancellor authored
      [ Upstream commit aaeed6ec ]
      
      There are two outstanding issues with CONFIG_X86_X32_ABI and
      llvm-objcopy, with similar root causes:
      
      1. llvm-objcopy does not properly convert .note.gnu.property when going
         from x86_64 to x86_x32, resulting in a corrupted section when
         linking:
      
         https://github.com/ClangBuiltLinux/linux/issues/1141
      
      2. llvm-objcopy produces corrupted compressed debug sections when going
         from x86_64 to x86_x32, also resulting in an error when linking:
      
         https://github.com/ClangBuiltLinux/linux/issues/514
      
      
      
      After commit 41c5ef31ad71 ("x86/ibt: Base IBT bits"), the
      .note.gnu.property section is always generated when
      CONFIG_X86_KERNEL_IBT is enabled, which causes the first issue to become
      visible with an allmodconfig build:
      
        ld.lld: error: arch/x86/entry/vdso/vclock_gettime-x32.o:(.note.gnu.property+0x1c): program property is too short
      
      To avoid this error, do not allow CONFIG_X86_X32_ABI to be selected when
      using llvm-objcopy. If the two issues ever get fixed in llvm-objcopy,
      this can be turned into a feature check.
      
      Signed-off-by: default avatarNathan Chancellor <nathan@kernel.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lore.kernel.org/r/20220314194842.3452-3-nathan@kernel.org
      
      
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      c393a9f4
    • Peter Zijlstra's avatar
      x86: Annotate call_on_stack() · e3c961c5
      Peter Zijlstra authored
      [ Upstream commit be007595
      
       ]
      
      vmlinux.o: warning: objtool: page_fault_oops()+0x13c: unreachable instruction
      
      0000 000000000005b460 <page_fault_oops>:
      ...
      0128    5b588:  49 89 23                mov    %rsp,(%r11)
      012b    5b58b:  4c 89 dc                mov    %r11,%rsp
      012e    5b58e:  4c 89 f2                mov    %r14,%rdx
      0131    5b591:  48 89 ee                mov    %rbp,%rsi
      0134    5b594:  4c 89 e7                mov    %r12,%rdi
      0137    5b597:  e8 00 00 00 00          call   5b59c <page_fault_oops+0x13c>    5b598: R_X86_64_PLT32   handle_stack_overflow-0x4
      013c    5b59c:  5c                      pop    %rsp
      
      vmlinux.o: warning: objtool: sysvec_reboot()+0x6d: unreachable instruction
      
      0000 00000000000033f0 <sysvec_reboot>:
      ...
      005d     344d:  4c 89 dc                mov    %r11,%rsp
      0060     3450:  e8 00 00 00 00          call   3455 <sysvec_reboot+0x65>        3451: R_X86_64_PLT32    irq_enter_rcu-0x4
      0065     3455:  48 89 ef                mov    %rbp,%rdi
      0068     3458:  e8 00 00 00 00          call   345d <sysvec_reboot+0x6d>        3459: R_X86_64_PC32     .text+0x47d0c
      006d     345d:  e8 00 00 00 00          call   3462 <sysvec_reboot+0x72>        345e: R_X86_64_PLT32    irq_exit_rcu-0x4
      0072     3462:  5c                      pop    %rsp
      
      Both cases are due to a call_on_stack() calling a __noreturn function.
      Since that's an inline asm, GCC can't do anything about the
      instructions after the CALL. Therefore put in an explicit
      ASM_REACHABLE annotation to make sure objtool and gcc are consistently
      confused about control flow.
      
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Link: https://lore.kernel.org/r/20220308154319.468805622@infradead.org
      
      
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      e3c961c5
    • NeilBrown's avatar
      NFS: swap-out must always use STABLE writes. · 6bb22702
      NeilBrown authored
      [ Upstream commit c265de25
      
       ]
      
      The commit handling code is not safe against memory-pressure deadlocks
      when writing to swap.  In particular, nfs_commitdata_alloc() blocks
      indefinitely waiting for memory, and this can consume all available
      workqueue threads.
      
      swap-out most likely uses STABLE writes anyway as COND_STABLE indicates
      that a stable write should be used if the write fits in a single
      request, and it normally does.  However if we ever swap with a small
      wsize, or gather unusually large numbers of pages for a single write,
      this might change.
      
      For safety, make it explicit in the code that direct writes used for swap
      must always use FLUSH_STABLE.
      
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      6bb22702
    • NeilBrown's avatar
      NFS: swap IO handling is slightly different for O_DIRECT IO · 24d28d9b
      NeilBrown authored
      [ Upstream commit 64158668
      
       ]
      
      1/ Taking the i_rwsem for swap IO triggers lockdep warnings regarding
         possible deadlocks with "fs_reclaim".  These deadlocks could, I believe,
         eventuate if a buffered read on the swapfile was attempted.
      
         We don't need coherence with the page cache for a swap file, and
         buffered writes are forbidden anyway.  There is no other need for
         i_rwsem during direct IO.  So never take it for swap_rw()
      
      2/ generic_write_checks() explicitly forbids writes to swap, and
         performs checks that are not needed for swap.  So bypass it
         for swap_rw().
      
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      24d28d9b
    • NeilBrown's avatar
      SUNRPC: remove scheduling boost for "SWAPPER" tasks. · a5538640
      NeilBrown authored
      [ Upstream commit a80a8461
      
       ]
      
      Currently, tasks marked as "swapper" tasks get put to the front of
      non-priority rpc_queues, and are sorted earlier than non-swapper tasks on
      the transport's ->xmit_queue.
      
      This is pointless as currently *all* tasks for a mount that has swap
      enabled on *any* file are marked as "swapper" tasks.  So the net result
      is that the non-priority rpc_queues are reverse-ordered (LIFO).
      
      This scheduling boost is not necessary to avoid deadlocks, and hurts
      fairness, so remove it.  If there were a need to expedite some requests,
      the tk_priority mechanism is a more appropriate tool.
      
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      a5538640
    • NeilBrown's avatar
      SUNRPC/xprt: async tasks mustn't block waiting for memory · 20700aa0
      NeilBrown authored
      [ Upstream commit a7210354
      
       ]
      
      When memory is short, new worker threads cannot be created and we depend
      on the minimum one rpciod thread to be able to handle everything.  So it
      must not block waiting for memory.
      
      xprt_dynamic_alloc_slot can block indefinitely.  This can tie up all
      workqueue threads and NFS can deadlock.  So when called from a
      workqueue, set __GFP_NORETRY.
      
      The rdma alloc_slot already does not block.  However it sets the error
      to -EAGAIN suggesting this will trigger a sleep.  It does not.  As we
      can see in call_reserveresult(), only -ENOMEM causes a sleep.  -EAGAIN
      causes immediate retry.
      
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      20700aa0
    • NeilBrown's avatar
      SUNRPC/call_alloc: async tasks mustn't block waiting for memory · a19fd1d6
      NeilBrown authored
      [ Upstream commit c487216b
      
       ]
      
      When memory is short, new worker threads cannot be created and we depend
      on the minimum one rpciod thread to be able to handle everything.
      So it must not block waiting for memory.
      
      mempools are particularly a problem as memory can only be released back
      to the mempool by an async rpc task running.  If all available
      workqueue threads are waiting on the mempool, no thread is available to
      return anything.
      
      rpc_malloc() can block, and this might cause deadlocks.
      So check RPC_IS_ASYNC(), rather than RPC_IS_SWAPPER() to determine if
      blocking is acceptable.
      
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      a19fd1d6
    • Maxime Ripard's avatar
      clk: Enforce that disjoints limits are invalid · b07387c4
      Maxime Ripard authored
      [ Upstream commit 10c46f2e
      
       ]
      
      If we were to have two users of the same clock, doing something like:
      
      clk_set_rate_range(user1, 1000, 2000);
      clk_set_rate_range(user2, 3000, 4000);
      
      The second call would fail with -EINVAL, preventing from getting in a
      situation where we end up with impossible limits.
      
      However, this is never explicitly checked against and enforced, and
      works by relying on an undocumented behaviour of clk_set_rate().
      
      Indeed, on the first clk_set_rate_range will make sure the current clock
      rate is within the new range, so it will be between 1000 and 2000Hz. On
      the second clk_set_rate_range(), it will consider (rightfully), that our
      current clock is outside of the 3000-4000Hz range, and will call
      clk_core_set_rate_nolock() to set it to 3000Hz.
      
      clk_core_set_rate_nolock() will then call clk_calc_new_rates() that will
      eventually check that our rate 3000Hz rate is outside the min 3000Hz max
      2000Hz range, will bail out, the error will propagate and we'll
      eventually return -EINVAL.
      
      This solely relies on the fact that clk_calc_new_rates(), and in
      particular clk_core_determine_round_nolock(), won't modify the new rate
      allowing the error to be reported. That assumption won't be true for all
      drivers, and most importantly we'll break that assumption in a later
      patch.
      
      It can also be argued that we shouldn't even reach the point where we're
      calling clk_core_set_rate_nolock().
      
      Let's make an explicit check for disjoints range before we're doing
      anything.
      
      Signed-off-by: default avatarMaxime Ripard <maxime@cerno.tech>
      Link: https://lore.kernel.org/r/20220225143534.405820-4-maxime@cerno.tech
      
      
      Signed-off-by: default avatarStephen Boyd <sboyd@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      b07387c4
    • Tony Lindgren's avatar
      clk: ti: Preserve node in ti_dt_clocks_register() · 15bfec9d
      Tony Lindgren authored
      [ Upstream commit 80864594
      
       ]
      
      In preparation for making use of the clock-output-names, we want to
      keep node around in ti_dt_clocks_register().
      
      This change should not needed as a fix currently.
      
      Signed-off-by: default avatarTony Lindgren <tony@atomide.com>
      Link: https://lore.kernel.org/r/20220204071449.16762-3-tony@atomide.com
      
      
      Signed-off-by: default avatarStephen Boyd <sboyd@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      15bfec9d
    • Dongli Zhang's avatar
      xen: delay xen_hvm_init_time_ops() if kdump is boot on vcpu>=32 · 5c0750ca
      Dongli Zhang authored
      [ Upstream commit eed05744 ]
      
      The sched_clock() can be used very early since commit 857baa87
      ("sched/clock: Enable sched clock early"). In addition, with commit
      38669ba2
      
       ("x86/xen/time: Output xen sched_clock time from 0"), kdump
      kernel in Xen HVM guest may panic at very early stage when accessing
      &__this_cpu_read(xen_vcpu)->time as in below:
      
      setup_arch()
       -> init_hypervisor_platform()
           -> x86_init.hyper.init_platform = xen_hvm_guest_init()
               -> xen_hvm_init_time_ops()
                   -> xen_clocksource_read()
                       -> src = &__this_cpu_read(xen_vcpu)->time;
      
      This is because Xen HVM supports at most MAX_VIRT_CPUS=32 'vcpu_info'
      embedded inside 'shared_info' during early stage until xen_vcpu_setup() is
      used to allocate/relocate 'vcpu_info' for boot cpu at arbitrary address.
      
      However, when Xen HVM guest panic on vcpu >= 32, since
      xen_vcpu_info_reset(0) would set per_cpu(xen_vcpu, cpu) = NULL when
      vcpu >= 32, xen_clocksource_read() on vcpu >= 32 would panic.
      
      This patch calls xen_hvm_init_time_ops() again later in
      xen_hvm_smp_prepare_boot_cpu() after the 'vcpu_info' for boot vcpu is
      registered when the boot vcpu is >= 32.
      
      This issue can be reproduced on purpose via below command at the guest
      side when kdump/kexec is enabled:
      
      "taskset -c 33 echo c > /proc/sysrq-trigger"
      
      The bugfix for PVM is not implemented due to the lack of testing
      environment.
      
      [boris: xen_hvm_init_time_ops() returns on errors instead of jumping to end]
      
      Cc: Joe Jin <joe.jin@oracle.com>
      Signed-off-by: default avatarDongli Zhang <dongli.zhang@oracle.com>
      Reviewed-by: default avatarBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Link: https://lore.kernel.org/r/20220302164032.14569-3-dongli.zhang@oracle.com
      
      
      Signed-off-by: default avatarBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      5c0750ca
    • Ohad Sharabi's avatar
      habanalabs: fix possible memory leak in MMU DR fini · 12e49aef
      Ohad Sharabi authored
      [ Upstream commit eb85eec8
      
       ]
      
      This patch fixes what seems to be copy paste error.
      
      We will have a memory leak if the host-resident shadow is NULL (which
      will likely happen as the DR and HR are not dependent).
      
      Signed-off-by: default avatarOhad Sharabi <osharabi@habana.ai>
      Reviewed-by: default avatarOded Gabbay <ogabbay@kernel.org>
      Signed-off-by: default avatarOded Gabbay <ogabbay@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      12e49aef
    • Trond Myklebust's avatar
      NFSv4: Protect the state recovery thread against direct reclaim · a34752aa
      Trond Myklebust authored
      [ Upstream commit 3e17898a
      
       ]
      
      If memory allocation triggers a direct reclaim from the state recovery
      thread, then we can deadlock. Use memalloc_nofs_save/restore to ensure
      that doesn't happen.
      
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      a34752aa
    • Xin Xiong's avatar
      NFSv4.2: fix reference count leaks in _nfs42_proc_copy_notify() · b37f482b
      Xin Xiong authored
      [ Upstream commit b7f114ed ]
      
      [You don't often get email from xiongx18@fudan.edu.cn. Learn why this is important at http://aka.ms/LearnAboutSenderIdentification.
      
      ]
      
      The reference counting issue happens in two error paths in the
      function _nfs42_proc_copy_notify(). In both error paths, the function
      simply returns the error code and forgets to balance the refcount of
      object `ctx`, bumped by get_nfs_open_context() earlier, which may
      cause refcount leaks.
      
      Fix it by balancing refcount of the `ctx` object before the function
      returns in both error paths.
      
      Signed-off-by: default avatarXin Xiong <xiongx18@fudan.edu.cn>
      Signed-off-by: default avatarXiyu Yang <xiyuyang19@fudan.edu.cn>
      Signed-off-by: default avatarXin Tan <tanxin.ctf@gmail.com>
      Signed-off-by: default avatarTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      b37f482b
    • Lucas Denefle's avatar
      w1: w1_therm: fixes w1_seq for ds28ea00 sensors · 24acdd5f
      Lucas Denefle authored
      [ Upstream commit 41a92a89
      
       ]
      
      w1_seq was failing due to several devices responding to the
      CHAIN_DONE at the same time. Now properly selects the current
      device in the chain with MATCH_ROM. Also acknowledgment was
      read twice.
      
      Signed-off-by: default avatarLucas Denefle <lucas.denefle@converge.io>
      Link: https://lore.kernel.org/r/20220223113558.232750-1-lucas.denefle@converge.io
      
      
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      24acdd5f