Skip to content
  1. Jul 28, 2021
    • Alexander Potapenko's avatar
      kfence: skip all GFP_ZONEMASK allocations · 5040926b
      Alexander Potapenko authored
      commit 236e9f15 upstream.
      
      Allocation requests outside ZONE_NORMAL (MOVABLE, HIGHMEM or DMA) cannot
      be fulfilled by KFENCE, because KFENCE memory pool is located in a zone
      different from the requested one.
      
      Because callers of kmem_cache_alloc() may actually rely on the
      allocation to reside in the requested zone (e.g.  memory allocations
      done with __GFP_DMA must be DMAable), skip all allocations done with
      GFP_ZONEMASK and/or respective SLAB flags (SLAB_CACHE_DMA and
      SLAB_CACHE_DMA32).
      
      Link: https://lkml.kernel.org/r/20210714092222.1890268-2-glider@google.com
      Fixes: 0ce20dd8
      
       ("mm: add Kernel Electric-Fence infrastructure")
      Signed-off-by: default avatarAlexander Potapenko <glider@google.com>
      Reviewed-by: default avatarMarco Elver <elver@google.com>
      Acked-by: default avatarSouptick Joarder <jrdr.linux@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Cc: <stable@vger.kernel.org>	[5.12+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5040926b
    • Alexander Potapenko's avatar
      kfence: move the size check to the beginning of __kfence_alloc() · e9adaed2
      Alexander Potapenko authored
      commit 235a85cb upstream.
      
      Check the allocation size before toggling kfence_allocation_gate.
      
      This way allocations that can't be served by KFENCE will not result in
      waiting for another CONFIG_KFENCE_SAMPLE_INTERVAL without allocating
      anything.
      
      Link: https://lkml.kernel.org/r/20210714092222.1890268-1-glider@google.com
      
      
      Signed-off-by: default avatarAlexander Potapenko <glider@google.com>
      Suggested-by: default avatarMarco Elver <elver@google.com>
      Reviewed-by: default avatarMarco Elver <elver@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Marco Elver <elver@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: <stable@vger.kernel.org>	[5.12+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e9adaed2
    • Peter Collingbourne's avatar
      userfaultfd: do not untag user pointers · 60e7f63d
      Peter Collingbourne authored
      commit e71e2ace upstream.
      
      Patch series "userfaultfd: do not untag user pointers", v5.
      
      If a user program uses userfaultfd on ranges of heap memory, it may end
      up passing a tagged pointer to the kernel in the range.start field of
      the UFFDIO_REGISTER ioctl.  This can happen when using an MTE-capable
      allocator, or on Android if using the Tagged Pointers feature for MTE
      readiness [1].
      
      When a fault subsequently occurs, the tag is stripped from the fault
      address returned to the application in the fault.address field of struct
      uffd_msg.  However, from the application's perspective, the tagged
      address *is* the memory address, so if the application is unaware of
      memory tags, it may get confused by receiving an address that is, from
      its point of view, outside of the bounds of the allocation.  We observed
      this behavior in the kselftest for userfaultfd [2] but other
      applications could have the same problem.
      
      Address this by not untagging pointers passed to the userfaultfd ioctls.
      Instead, let the system call fail.  Also change the kselftest to use
      mmap so that it doesn't encounter this problem.
      
      [1] https://source.android.com/devices/tech/debug/tagged-pointers
      [2] tools/testing/selftests/vm/userfaultfd.c
      
      This patch (of 2):
      
      Do not untag pointers passed to the userfaultfd ioctls.  Instead, let
      the system call fail.  This will provide an early indication of problems
      with tag-unaware userspace code instead of letting the code get confused
      later, and is consistent with how we decided to handle brk/mmap/mremap
      in commit dcde2373 ("mm: Avoid creating virtual address aliases in
      brk()/mmap()/mremap()"), as well as being consistent with the existing
      tagged address ABI documentation relating to how ioctl arguments are
      handled.
      
      The code change is a revert of commit 7d032574 ("userfaultfd: untag
      user pointers") plus some fixups to some additional calls to
      validate_range that have appeared since then.
      
      [1] https://source.android.com/devices/tech/debug/tagged-pointers
      [2] tools/testing/selftests/vm/userfaultfd.c
      
      Link: https://lkml.kernel.org/r/20210714195437.118982-1-pcc@google.com
      Link: https://lkml.kernel.org/r/20210714195437.118982-2-pcc@google.com
      Link: https://linux-review.googlesource.com/id/I761aa9f0344454c482b83fcfcce547db0a25501b
      Fixes: 63f0c603
      
       ("arm64: Introduce prctl() options to control the tagged user addresses ABI")
      Signed-off-by: default avatarPeter Collingbourne <pcc@google.com>
      Reviewed-by: default avatarAndrey Konovalov <andreyknvl@gmail.com>
      Reviewed-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Alistair Delva <adelva@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Martin <Dave.Martin@arm.com>
      Cc: Evgenii Stepanov <eugenis@google.com>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Mitch Phillips <mitchp@google.com>
      Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: William McVicker <willmcvicker@google.com>
      Cc: <stable@vger.kernel.org>	[5.4]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      60e7f63d
    • Jens Axboe's avatar
      io_uring: fix early fdput() of file · a6ead781
      Jens Axboe authored
      commit 0cc936f7 upstream.
      
      A previous commit shuffled some code around, and inadvertently used
      struct file after fdput() had been called on it. As we can't touch
      the file post fdput() dropping our reference, move the fdput() to
      after that has been done.
      
      Cc: Pavel Begunkov <asml.silence@gmail.com>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/io-uring/YPnqM0fY3nM5RdRI@zeniv-ca.linux.org.uk/
      Fixes: f2a48dd0
      
       ("io_uring: refactor io_sq_offload_create()")
      Reported-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a6ead781
    • Pavel Begunkov's avatar
      io_uring: remove double poll entry on arm failure · 81cebade
      Pavel Begunkov authored
      commit 46fee9ab upstream.
      
      __io_queue_proc() can enqueue both poll entries and still fail
      afterwards, so the callers trying to cancel it should also try to remove
      the second poll entry (if any).
      
      For example, it may leave the request alive referencing a io_uring
      context but not accessible for cancellation:
      
      [  282.599913][ T1620] task:iou-sqp-23145   state:D stack:28720 pid:23155 ppid:  8844 flags:0x00004004
      [  282.609927][ T1620] Call Trace:
      [  282.613711][ T1620]  __schedule+0x93a/0x26f0
      [  282.634647][ T1620]  schedule+0xd3/0x270
      [  282.638874][ T1620]  io_uring_cancel_generic+0x54d/0x890
      [  282.660346][ T1620]  io_sq_thread+0xaac/0x1250
      [  282.696394][ T1620]  ret_from_fork+0x1f/0x30
      
      Cc: stable@vger.kernel.org
      Fixes: 18bceab1
      
       ("io_uring: allow POLL_ADD with double poll_wait() users")
      Reported-and-tested-by: default avatar <syzbot+ac957324022b7132accf@syzkaller.appspotmail.com>
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Link: https://lore.kernel.org/r/0ec1228fc5eda4cb524eeda857da8efdc43c331c.1626774457.git.asml.silence@gmail.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      81cebade
    • Pavel Begunkov's avatar
      io_uring: explicitly count entries for poll reqs · 0d80ae09
      Pavel Begunkov authored
      commit 68b11e8b upstream.
      
      If __io_queue_proc() fails to add a second poll entry, e.g. kmalloc()
      failed, but it goes on with a third waitqueue, it may succeed and
      overwrite the error status. Count the number of poll entries we added,
      so we can set pt->error to zero at the beginning and find out when the
      mentioned scenario happens.
      
      Cc: stable@vger.kernel.org
      Fixes: 18bceab1
      
       ("io_uring: allow POLL_ADD with double poll_wait() users")
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Link: https://lore.kernel.org/r/9d6b9e561f88bcc0163623b74a76c39f712151c3.1626774457.git.asml.silence@gmail.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0d80ae09
    • Peter Collingbourne's avatar
      selftest: use mmap instead of posix_memalign to allocate memory · 2f13b6fe
      Peter Collingbourne authored
      commit 0db282ba upstream.
      
      This test passes pointers obtained from anon_allocate_area to the
      userfaultfd and mremap APIs.  This causes a problem if the system
      allocator returns tagged pointers because with the tagged address ABI
      the kernel rejects tagged addresses passed to these APIs, which would
      end up causing the test to fail.  To make this test compatible with such
      system allocators, stop using the system allocator to allocate memory in
      anon_allocate_area, and instead just use mmap.
      
      Link: https://lkml.kernel.org/r/20210714195437.118982-3-pcc@google.com
      Link: https://linux-review.googlesource.com/id/Icac91064fcd923f77a83e8e133f8631c5b8fc241
      Fixes: c47174fc
      
       ("userfaultfd: selftest")
      Co-developed-by: default avatarLokesh Gidra <lokeshgidra@google.com>
      Signed-off-by: default avatarLokesh Gidra <lokeshgidra@google.com>
      Signed-off-by: default avatarPeter Collingbourne <pcc@google.com>
      Reviewed-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
      Cc: Dave Martin <Dave.Martin@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Alistair Delva <adelva@google.com>
      Cc: William McVicker <willmcvicker@google.com>
      Cc: Evgenii Stepanov <eugenis@google.com>
      Cc: Mitch Phillips <mitchp@google.com>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: <stable@vger.kernel.org>	[5.4]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2f13b6fe
    • Frederic Weisbecker's avatar
      posix-cpu-timers: Fix rearm racing against process tick · fae0c4bb
      Frederic Weisbecker authored
      commit 1a3402d9
      
       upstream.
      
      Since the process wide cputime counter is started locklessly from
      posix_cpu_timer_rearm(), it can be concurrently stopped by operations
      on other timers from the same thread group, such as in the following
      unlucky scenario:
      
               CPU 0                                CPU 1
               -----                                -----
                                                 timer_settime(TIMER B)
         posix_cpu_timer_rearm(TIMER A)
             cpu_clock_sample_group()
                 (pct->timers_active already true)
      
                                                 handle_posix_cpu_timers()
                                                     check_process_timers()
                                                         stop_process_timers()
                                                             pct->timers_active = false
             arm_timer(TIMER A)
      
         tick -> run_posix_cpu_timers()
             // sees !pct->timers_active, ignore
             // our TIMER A
      
      Fix this with simply locking process wide cputime counting start and
      timer arm in the same block.
      
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarFrederic Weisbecker <frederic@kernel.org>
      Fixes: 60f2ceaa
      
       ("posix-cpu-timers: Remove unnecessary locking around cpu_clock_sample_group")
      Cc: stable@vger.kernel.org
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fae0c4bb
    • Loic Poulain's avatar
      bus: mhi: pci_generic: Fix inbound IPCR channel · 52db60a9
      Loic Poulain authored
      commit b8a97f2a upstream.
      
      The qrtr-mhi client driver assumes that inbound buffers are
      automatically allocated and queued by the MHI core, but this
      doesn't happen for mhi pci devices since IPCR inbound channel is
      not flagged with auto_queue, causing unusable IPCR (qrtr)
      feature. Fix that.
      
      Link: https://lore.kernel.org/r/1625736749-24947-1-git-send-email-loic.poulain@linaro.org
      [mani: fixed a spelling mistake in commit description]
      Fixes: 855a70c1
      
       ("bus: mhi: Add MHI PCI support for WWAN modems")
      Cc: stable@vger.kernel.org #5.10
      Reviewed-by: default avatarHemant kumar <hemantk@codeaurora.org>
      Reviewed-by: default avatarManivannan Sadhasivam <mani@kernel.org>
      Signed-off-by: default avatarLoic Poulain <loic.poulain@linaro.org>
      Signed-off-by: default avatarManivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
      Link: https://lore.kernel.org/r/20210716075106.49938-4-manivannan.sadhasivam@linaro.org
      
      
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      52db60a9
    • Bhaumik Bhatt's avatar
      bus: mhi: core: Validate channel ID when processing command completions · aed4f5b5
      Bhaumik Bhatt authored
      commit 546362a9 upstream.
      
      MHI reads the channel ID from the event ring element sent by the
      device which can be any value between 0 and 255. In order to
      prevent any out of bound accesses, add a check against the maximum
      number of channels supported by the controller and those channels
      not configured yet so as to skip processing of that event ring
      element.
      
      Link: https://lore.kernel.org/r/1624558141-11045-1-git-send-email-bbhatt@codeaurora.org
      Fixes: 1d3173a3
      
       ("bus: mhi: core: Add support for processing events from client device")
      Cc: stable@vger.kernel.org #5.10
      Reviewed-by: default avatarHemant Kumar <hemantk@codeaurora.org>
      Reviewed-by: default avatarManivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
      Reviewed-by: default avatarJeffrey Hugo <quic_jhugo@quicinc.com>
      Signed-off-by: default avatarBhaumik Bhatt <bbhatt@codeaurora.org>
      Signed-off-by: default avatarManivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
      Link: https://lore.kernel.org/r/20210716075106.49938-3-manivannan.sadhasivam@linaro.org
      
      
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      aed4f5b5
    • Bhaumik Bhatt's avatar
      bus: mhi: pci_generic: Apply no-op for wake using sideband wake boolean · a8827068
      Bhaumik Bhatt authored
      commit 56f6f4c4 upstream.
      
      Devices such as SDX24 do not have the provision for inband wake
      doorbell in the form of channel 127 and instead have a sideband
      GPIO for it. Newer devices such as SDX55 or SDX65 support inband
      wake method by default. Ensure the functionality is used based on
      this such that device wake stays held when a client driver uses
      mhi_device_get() API or the equivalent debugfs entry.
      
      Link: https://lore.kernel.org/r/1624560809-30610-1-git-send-email-bbhatt@codeaurora.org
      Fixes: e3e5e650
      
       ("bus: mhi: pci_generic: No-Op for device_wake operations")
      Cc: stable@vger.kernel.org #5.12
      Reviewed-by: default avatarManivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
      Signed-off-by: default avatarBhaumik Bhatt <bbhatt@codeaurora.org>
      Signed-off-by: default avatarManivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
      Link: https://lore.kernel.org/r/20210716075106.49938-2-manivannan.sadhasivam@linaro.org
      
      
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a8827068
    • Peter Ujfalusi's avatar
      driver core: auxiliary bus: Fix memory leak when driver_register() fail · ce5b3de5
      Peter Ujfalusi authored
      commit 4afa0c22 upstream.
      
      If driver_register() returns with error we need to free the memory
      allocated for auxdrv->driver.name before returning from
      __auxiliary_driver_register()
      
      Fixes: 7de3697e
      
       ("Add auxiliary bus support")
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Cc: stable <stable@vger.kernel.org>
      Signed-off-by: default avatarPeter Ujfalusi <peter.ujfalusi@linux.intel.com>
      Link: https://lore.kernel.org/r/20210713093438.3173-1-peter.ujfalusi@linux.intel.com
      
      
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ce5b3de5
    • Markus Boehme's avatar
      ixgbe: Fix packet corruption due to missing DMA sync · 423123e4
      Markus Boehme authored
      commit 09cfae9f upstream.
      
      When receiving a packet with multiple fragments, hardware may still
      touch the first fragment until the entire packet has been received. The
      driver therefore keeps the first fragment mapped for DMA until end of
      packet has been asserted, and delays its dma_sync call until then.
      
      The driver tries to fit multiple receive buffers on one page. When using
      3K receive buffers (e.g. using Jumbo frames and legacy-rx is turned
      off/build_skb is being used) on an architecture with 4K pages, the
      driver allocates an order 1 compound page and uses one page per receive
      buffer. To determine the correct offset for a delayed DMA sync of the
      first fragment of a multi-fragment packet, the driver then cannot just
      use PAGE_MASK on the DMA address but has to construct a mask based on
      the actual size of the backing page.
      
      Using PAGE_MASK in the 3K RX buffer/4K page architecture configuration
      will always sync the first page of a compound page. With the SWIOTLB
      enabled this can lead to corrupted packets (zeroed out first fragment,
      re-used garbage from another packet) and various consequences, such as
      slow/stalling data transfers and connection resets. For example, testing
      on a link with MTU exceeding 3058 bytes on a host with SWIOTLB enabled
      (e.g. "iommu=soft swiotlb=262144,force") TCP transfers quickly fizzle
      out without this patch.
      
      Cc: stable@vger.kernel.org
      Fixes: 0c5661ec
      
       ("ixgbe: fix crash in build_skb Rx code path")
      Signed-off-by: default avatarMarkus Boehme <markubo@amazon.com>
      Tested-by: default avatarTony Brelinski <tonyx.brelinski@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      423123e4
    • Gustavo A. R. Silva's avatar
      media: ngene: Fix out-of-bounds bug in ngene_command_config_free_buf() · b9a178f1
      Gustavo A. R. Silva authored
      commit 8d4abca9 upstream.
      
      Fix an 11-year old bug in ngene_command_config_free_buf() while
      addressing the following warnings caught with -Warray-bounds:
      
      arch/alpha/include/asm/string.h:22:16: warning: '__builtin_memcpy' offset [12, 16] from the object at 'com' is out of the bounds of referenced subobject 'config' with type 'unsigned char' at offset 10 [-Warray-bounds]
      arch/x86/include/asm/string_32.h:182:25: warning: '__builtin_memcpy' offset [12, 16] from the object at 'com' is out of the bounds of referenced subobject 'config' with type 'unsigned char' at offset 10 [-Warray-bounds]
      
      The problem is that the original code is trying to copy 6 bytes of
      data into a one-byte size member _config_ of the wrong structue
      FW_CONFIGURE_BUFFERS, in a single call to memcpy(). This causes a
      legitimate compiler warning because memcpy() overruns the length
      of &com.cmd.ConfigureBuffers.config. It seems that the right
      structure is FW_CONFIGURE_FREE_BUFFERS, instead, because it contains
      6 more members apart from the header _hdr_. Also, the name of
      the function ngene_command_config_free_buf() suggests that the actual
      intention is to ConfigureFreeBuffers, instead of ConfigureBuffers
      (which takes place in the function ngene_command_config_buf(), above).
      
      Fix this by enclosing those 6 members of struct FW_CONFIGURE_FREE_BUFFERS
      into new struct config, and use &com.cmd.ConfigureFreeBuffers.config as
      the destination address, instead of &com.cmd.ConfigureBuffers.config,
      when calling memcpy().
      
      This also helps with the ongoing efforts to globally enable
      -Warray-bounds and get us closer to being able to tighten the
      FORTIFY_SOURCE routines on memcpy().
      
      Link: https://github.com/KSPP/linux/issues/109
      Fixes: dae52d00
      
       ("V4L/DVB: ngene: Initial check-in")
      Cc: stable@vger.kernel.org
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarGustavo A. R. Silva <gustavoars@kernel.org>
      Link: https://lore.kernel.org/linux-hardening/20210420001631.GA45456@embeddedor/
      
      
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b9a178f1
    • Filipe Manana's avatar
      btrfs: fix lock inversion problem when doing qgroup extent tracing · f5ef2fe0
      Filipe Manana authored
      commit 8949b9a1
      
       upstream.
      
      At btrfs_qgroup_trace_extent_post() we call btrfs_find_all_roots() with a
      NULL value as the transaction handle argument, which makes that function
      take the commit_root_sem semaphore, which is necessary when we don't hold
      a transaction handle or any other mechanism to prevent a transaction
      commit from wiping out commit roots.
      
      However btrfs_qgroup_trace_extent_post() can be called in a context where
      we are holding a write lock on an extent buffer from a subvolume tree,
      namely from btrfs_truncate_inode_items(), called either during truncate
      or unlink operations. In this case we end up with a lock inversion problem
      because the commit_root_sem is a higher level lock, always supposed to be
      acquired before locking any extent buffer.
      
      Lockdep detects this lock inversion problem since we switched the extent
      buffer locks from custom locks to semaphores, and when running btrfs/158
      from fstests, it reported the following trace:
      
      [ 9057.626435] ======================================================
      [ 9057.627541] WARNING: possible circular locking dependency detected
      [ 9057.628334] 5.14.0-rc2-btrfs-next-93 #1 Not tainted
      [ 9057.628961] ------------------------------------------------------
      [ 9057.629867] kworker/u16:4/30781 is trying to acquire lock:
      [ 9057.630824] ffff8e2590f58760 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x24/0x110 [btrfs]
      [ 9057.632542]
                     but task is already holding lock:
      [ 9057.633551] ffff8e25582d4b70 (&fs_info->commit_root_sem){++++}-{3:3}, at: iterate_extent_inodes+0x10b/0x280 [btrfs]
      [ 9057.635255]
                     which lock already depends on the new lock.
      
      [ 9057.636292]
                     the existing dependency chain (in reverse order) is:
      [ 9057.637240]
                     -> #1 (&fs_info->commit_root_sem){++++}-{3:3}:
      [ 9057.638138]        down_read+0x46/0x140
      [ 9057.638648]        btrfs_find_all_roots+0x41/0x80 [btrfs]
      [ 9057.639398]        btrfs_qgroup_trace_extent_post+0x37/0x70 [btrfs]
      [ 9057.640283]        btrfs_add_delayed_data_ref+0x418/0x490 [btrfs]
      [ 9057.641114]        btrfs_free_extent+0x35/0xb0 [btrfs]
      [ 9057.641819]        btrfs_truncate_inode_items+0x424/0xf70 [btrfs]
      [ 9057.642643]        btrfs_evict_inode+0x454/0x4f0 [btrfs]
      [ 9057.643418]        evict+0xcf/0x1d0
      [ 9057.643895]        do_unlinkat+0x1e9/0x300
      [ 9057.644525]        do_syscall_64+0x3b/0xc0
      [ 9057.645110]        entry_SYSCALL_64_after_hwframe+0x44/0xae
      [ 9057.645835]
                     -> #0 (btrfs-tree-00){++++}-{3:3}:
      [ 9057.646600]        __lock_acquire+0x130e/0x2210
      [ 9057.647248]        lock_acquire+0xd7/0x310
      [ 9057.647773]        down_read_nested+0x4b/0x140
      [ 9057.648350]        __btrfs_tree_read_lock+0x24/0x110 [btrfs]
      [ 9057.649175]        btrfs_read_lock_root_node+0x31/0x40 [btrfs]
      [ 9057.650010]        btrfs_search_slot+0x537/0xc00 [btrfs]
      [ 9057.650849]        scrub_print_warning_inode+0x89/0x370 [btrfs]
      [ 9057.651733]        iterate_extent_inodes+0x1e3/0x280 [btrfs]
      [ 9057.652501]        scrub_print_warning+0x15d/0x2f0 [btrfs]
      [ 9057.653264]        scrub_handle_errored_block.isra.0+0x135f/0x1640 [btrfs]
      [ 9057.654295]        scrub_bio_end_io_worker+0x101/0x2e0 [btrfs]
      [ 9057.655111]        btrfs_work_helper+0xf8/0x400 [btrfs]
      [ 9057.655831]        process_one_work+0x247/0x5a0
      [ 9057.656425]        worker_thread+0x55/0x3c0
      [ 9057.656993]        kthread+0x155/0x180
      [ 9057.657494]        ret_from_fork+0x22/0x30
      [ 9057.658030]
                     other info that might help us debug this:
      
      [ 9057.659064]  Possible unsafe locking scenario:
      
      [ 9057.659824]        CPU0                    CPU1
      [ 9057.660402]        ----                    ----
      [ 9057.660988]   lock(&fs_info->commit_root_sem);
      [ 9057.661581]                                lock(btrfs-tree-00);
      [ 9057.662348]                                lock(&fs_info->commit_root_sem);
      [ 9057.663254]   lock(btrfs-tree-00);
      [ 9057.663690]
                      *** DEADLOCK ***
      
      [ 9057.664437] 4 locks held by kworker/u16:4/30781:
      [ 9057.665023]  #0: ffff8e25922a1148 ((wq_completion)btrfs-scrub){+.+.}-{0:0}, at: process_one_work+0x1c7/0x5a0
      [ 9057.666260]  #1: ffffabb3451ffe70 ((work_completion)(&work->normal_work)){+.+.}-{0:0}, at: process_one_work+0x1c7/0x5a0
      [ 9057.667639]  #2: ffff8e25922da198 (&ret->mutex){+.+.}-{3:3}, at: scrub_handle_errored_block.isra.0+0x5d2/0x1640 [btrfs]
      [ 9057.669017]  #3: ffff8e25582d4b70 (&fs_info->commit_root_sem){++++}-{3:3}, at: iterate_extent_inodes+0x10b/0x280 [btrfs]
      [ 9057.670408]
                     stack backtrace:
      [ 9057.670976] CPU: 7 PID: 30781 Comm: kworker/u16:4 Not tainted 5.14.0-rc2-btrfs-next-93 #1
      [ 9057.672030] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
      [ 9057.673492] Workqueue: btrfs-scrub btrfs_work_helper [btrfs]
      [ 9057.674258] Call Trace:
      [ 9057.674588]  dump_stack_lvl+0x57/0x72
      [ 9057.675083]  check_noncircular+0xf3/0x110
      [ 9057.675611]  __lock_acquire+0x130e/0x2210
      [ 9057.676132]  lock_acquire+0xd7/0x310
      [ 9057.676605]  ? __btrfs_tree_read_lock+0x24/0x110 [btrfs]
      [ 9057.677313]  ? lock_is_held_type+0xe8/0x140
      [ 9057.677849]  down_read_nested+0x4b/0x140
      [ 9057.678349]  ? __btrfs_tree_read_lock+0x24/0x110 [btrfs]
      [ 9057.679068]  __btrfs_tree_read_lock+0x24/0x110 [btrfs]
      [ 9057.679760]  btrfs_read_lock_root_node+0x31/0x40 [btrfs]
      [ 9057.680458]  btrfs_search_slot+0x537/0xc00 [btrfs]
      [ 9057.681083]  ? _raw_spin_unlock+0x29/0x40
      [ 9057.681594]  ? btrfs_find_all_roots_safe+0x11f/0x140 [btrfs]
      [ 9057.682336]  scrub_print_warning_inode+0x89/0x370 [btrfs]
      [ 9057.683058]  ? btrfs_find_all_roots_safe+0x11f/0x140 [btrfs]
      [ 9057.683834]  ? scrub_write_block_to_dev_replace+0xb0/0xb0 [btrfs]
      [ 9057.684632]  iterate_extent_inodes+0x1e3/0x280 [btrfs]
      [ 9057.685316]  scrub_print_warning+0x15d/0x2f0 [btrfs]
      [ 9057.685977]  ? ___ratelimit+0xa4/0x110
      [ 9057.686460]  scrub_handle_errored_block.isra.0+0x135f/0x1640 [btrfs]
      [ 9057.687316]  scrub_bio_end_io_worker+0x101/0x2e0 [btrfs]
      [ 9057.688021]  btrfs_work_helper+0xf8/0x400 [btrfs]
      [ 9057.688649]  ? lock_is_held_type+0xe8/0x140
      [ 9057.689180]  process_one_work+0x247/0x5a0
      [ 9057.689696]  worker_thread+0x55/0x3c0
      [ 9057.690175]  ? process_one_work+0x5a0/0x5a0
      [ 9057.690731]  kthread+0x155/0x180
      [ 9057.691158]  ? set_kthread_struct+0x40/0x40
      [ 9057.691697]  ret_from_fork+0x22/0x30
      
      Fix this by making btrfs_find_all_roots() never attempt to lock the
      commit_root_sem when it is called from btrfs_qgroup_trace_extent_post().
      
      We can't just pass a non-NULL transaction handle to btrfs_find_all_roots()
      from btrfs_qgroup_trace_extent_post(), because that would make backref
      lookup not use commit roots and acquire read locks on extent buffers, and
      therefore could deadlock when btrfs_qgroup_trace_extent_post() is called
      from the btrfs_truncate_inode_items() code path which has acquired a write
      lock on an extent buffer of the subvolume btree.
      
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f5ef2fe0
    • Filipe Manana's avatar
      btrfs: fix unpersisted i_size on fsync after expanding truncate · 6f919907
      Filipe Manana authored
      commit 9acc8103 upstream.
      
      If we have an inode that does not have the full sync flag set, was changed
      in the current transaction, then it is logged while logging some other
      inode (like its parent directory for example), its i_size is increased by
      a truncate operation, the log is synced through an fsync of some other
      inode and then finally we explicitly call fsync on our inode, the new
      i_size is not persisted.
      
      The following example shows how to trigger it, with comments explaining
      how and why the issue happens:
      
        $ mkfs.btrfs -f /dev/sdc
        $ mount /dev/sdc /mnt
      
        $ touch /mnt/foo
        $ xfs_io -f -c "pwrite -S 0xab 0 1M" /mnt/bar
      
        $ sync
      
        # Fsync bar, this will be a noop since the file has not yet been
        # modified in the current transaction. The goal here is to clear
        # BTRFS_INODE_NEEDS_FULL_SYNC from the inode's runtime flags.
        $ xfs_io -c "fsync" /mnt/bar
      
        # Now rename both files, without changing their parent directory.
        $ mv /mnt/bar /mnt/bar2
        $ mv /mnt/foo /mnt/foo2
      
        # Increase the size of bar2 with a truncate operation.
        $ xfs_io -c "truncate 2M" /mnt/bar2
      
        # Now fsync foo2, this results in logging its parent inode (the root
        # directory), and logging the parent results in logging the inode of
        # file bar2 (its inode item and the new name). The inode of file bar2
        # is logged with an i_size of 0 bytes since it's logged in
        # LOG_INODE_EXISTS mode, meaning we are only logging its names (and
        # xattrs if it had any) and the i_size of the inode will not be changed
        # when the log is replayed.
        $ xfs_io -c "fsync" /mnt/foo2
      
        # Now explicitly fsync bar2. This resulted in doing nothing, not
        # logging the inode with the new i_size of 2M and the hole from file
        # offset 1M to 2M. Because the inode did not have the flag
        # BTRFS_INODE_NEEDS_FULL_SYNC set, when it was logged through the
        # fsync of file foo2, its last_log_commit field was updated,
        # resulting in this explicit of file bar2 not doing anything.
        $ xfs_io -c "fsync" /mnt/bar2
      
        # File bar2 content and size before a power failure.
        $ od -A d -t x1 /mnt/bar2
        0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab
        *
        1048576 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        *
        2097152
      
        <power failure>
      
        # Mount the filesystem to replay the log.
        $ mount /dev/sdc /mnt
      
        # Read the file again, should have the same content and size as before
        # the power failure happened, but it doesn't, i_size is still at 1M.
        $ od -A d -t x1 /mnt/bar2
        0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab
        *
        1048576
      
      This started to happen after commit 209ecbb8 ("btrfs: remove stale
      comment and logic from btrfs_inode_in_log()"), since btrfs_inode_in_log()
      no longer checks if the inode's list of modified extents is not empty.
      However, checking that list is not the right way to address this case
      and the check was added long time ago in commit 125c4cf9
      
      
      ("Btrfs: set inode's logged_trans/last_log_commit after ranged fsync")
      for a different purpose, to address consecutive ranged fsyncs.
      
      The reason that checking for the list emptiness makes this test pass is
      because during an expanding truncate we create an extent map to represent
      a hole from the old i_size to the new i_size, and add that extent map to
      the list of modified extents in the inode. However if we are low on
      available memory and we can not allocate a new extent map, then we don't
      treat it as an error and just set the full sync flag on the inode, so that
      the next fsync does not rely on the list of modified extents - so checking
      for the emptiness of the list to decide if the inode needs to be logged is
      not reliable, and results in not logging the inode if it was not possible
      to allocate the extent map for the hole.
      
      Fix this by ensuring that if we are only logging that an inode exists
      (inode item, names/references and xattrs), we don't update the inode's
      last_log_commit even if it does not have the full sync runtime flag set.
      
      A test case for fstests follows soon.
      
      CC: stable@vger.kernel.org # 5.13+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6f919907
    • Anand Jain's avatar
      btrfs: check for missing device in btrfs_trim_fs · a02b5448
      Anand Jain authored
      commit 16a200f6
      
       upstream.
      
      A fstrim on a degraded raid1 can trigger the following null pointer
      dereference:
      
        BTRFS info (device loop0): allowing degraded mounts
        BTRFS info (device loop0): disk space caching is enabled
        BTRFS info (device loop0): has skinny extents
        BTRFS warning (device loop0): devid 2 uuid 97ac16f7-e14d-4db1-95bc-3d489b424adb is missing
        BTRFS warning (device loop0): devid 2 uuid 97ac16f7-e14d-4db1-95bc-3d489b424adb is missing
        BTRFS info (device loop0): enabling ssd optimizations
        BUG: kernel NULL pointer dereference, address: 0000000000000620
        PGD 0 P4D 0
        Oops: 0000 [#1] SMP NOPTI
        CPU: 0 PID: 4574 Comm: fstrim Not tainted 5.13.0-rc7+ #31
        Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
        RIP: 0010:btrfs_trim_fs+0x199/0x4a0 [btrfs]
        RSP: 0018:ffff959541797d28 EFLAGS: 00010293
        RAX: 0000000000000000 RBX: ffff946f84eca508 RCX: a7a67937adff8608
        RDX: ffff946e8122d000 RSI: 0000000000000000 RDI: ffffffffc02fdbf0
        RBP: ffff946ea4615000 R08: 0000000000000001 R09: 0000000000000000
        R10: 0000000000000000 R11: ffff946e8122d960 R12: 0000000000000000
        R13: ffff959541797db8 R14: ffff946e8122d000 R15: ffff959541797db8
        FS:  00007f55917a5080(0000) GS:ffff946f9bc00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000620 CR3: 000000002d2c8001 CR4: 00000000000706f0
        Call Trace:
        btrfs_ioctl_fitrim+0x167/0x260 [btrfs]
        btrfs_ioctl+0x1c00/0x2fe0 [btrfs]
        ? selinux_file_ioctl+0x140/0x240
        ? syscall_trace_enter.constprop.0+0x188/0x240
        ? __x64_sys_ioctl+0x83/0xb0
        __x64_sys_ioctl+0x83/0xb0
      
      Reproducer:
      
        $ mkfs.btrfs -fq -d raid1 -m raid1 /dev/loop0 /dev/loop1
        $ mount /dev/loop0 /btrfs
        $ umount /btrfs
        $ btrfs dev scan --forget
        $ mount -o degraded /dev/loop0 /btrfs
      
        $ fstrim /btrfs
      
      The reason is we call btrfs_trim_free_extents() for the missing device,
      which uses device->bdev (NULL for missing device) to find if the device
      supports discard.
      
      Fix is to check if the device is missing before calling
      btrfs_trim_free_extents().
      
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a02b5448
    • Steven Rostedt (VMware)'s avatar
      tracing: Synthetic event field_pos is an index not a boolean · 020d8cea
      Steven Rostedt (VMware) authored
      commit 3b13911a upstream.
      
      Performing the following:
      
       ># echo 'wakeup_lat s32 pid; u64 delta; char wake_comm[]' > synthetic_events
       ># echo 'hist:keys=pid:__arg__1=common_timestamp.usecs' > events/sched/sched_waking/trigger
       ># echo 'hist:keys=next_pid:pid=next_pid,delta=common_timestamp.usecs-$__arg__1:onmatch(sched.sched_waking).trace(wakeup_lat,$pid,$delta,prev_comm)'\
            > events/sched/sched_switch/trigger
       ># echo 1 > events/synthetic/enable
      
      Crashed the kernel:
      
       BUG: kernel NULL pointer dereference, address: 000000000000001b
       #PF: supervisor read access in kernel mode
       #PF: error_code(0x0000) - not-present page
       PGD 0 P4D 0
       Oops: 0000 [#1] PREEMPT SMP
       CPU: 7 PID: 0 Comm: swapper/7 Not tainted 5.13.0-rc5-test+ #104
       Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016
       RIP: 0010:strlen+0x0/0x20
       Code: f6 82 80 2b 0b bc 20 74 11 0f b6 50 01 48 83 c0 01 f6 82 80 2b 0b bc
        20 75 ef c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 <80> 3f 00 74 10
        48 89 f8 48 83 c0 01 80 38 9 f8 c3 31
       RSP: 0018:ffffaa75000d79d0 EFLAGS: 00010046
       RAX: 0000000000000002 RBX: ffff9cdb55575270 RCX: 0000000000000000
       RDX: ffff9cdb58c7a320 RSI: ffffaa75000d7b40 RDI: 000000000000001b
       RBP: ffffaa75000d7b40 R08: ffff9cdb40a4f010 R09: ffffaa75000d7ab8
       R10: ffff9cdb4398c700 R11: 0000000000000008 R12: ffff9cdb58c7a320
       R13: ffff9cdb55575270 R14: ffff9cdb58c7a000 R15: 0000000000000018
       FS:  0000000000000000(0000) GS:ffff9cdb5aa00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 000000000000001b CR3: 00000000c0612006 CR4: 00000000001706e0
       Call Trace:
        trace_event_raw_event_synth+0x90/0x1d0
        action_trace+0x5b/0x70
        event_hist_trigger+0x4bd/0x4e0
        ? cpumask_next_and+0x20/0x30
        ? update_sd_lb_stats.constprop.0+0xf6/0x840
        ? __lock_acquire.constprop.0+0x125/0x550
        ? find_held_lock+0x32/0x90
        ? sched_clock_cpu+0xe/0xd0
        ? lock_release+0x155/0x440
        ? update_load_avg+0x8c/0x6f0
        ? enqueue_entity+0x18a/0x920
        ? __rb_reserve_next+0xe5/0x460
        ? ring_buffer_lock_reserve+0x12a/0x3f0
        event_triggers_call+0x52/0xe0
        trace_event_buffer_commit+0x1ae/0x240
        trace_event_raw_event_sched_switch+0x114/0x170
        __traceiter_sched_switch+0x39/0x50
        __schedule+0x431/0xb00
        schedule_idle+0x28/0x40
        do_idle+0x198/0x2e0
        cpu_startup_entry+0x19/0x20
        secondary_startup_64_no_verify+0xc2/0xcb
      
      The reason is that the dynamic events array keeps track of the field
      position of the fields array, via the field_pos variable in the
      synth_field structure. Unfortunately, that field is a boolean for some
      reason, which means any field_pos greater than 1 will be a bug (in this
      case it was 2).
      
      Link: https://lkml.kernel.org/r/20210721191008.638bce34@oasis.local.home
      
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: stable@vger.kernel.org
      Fixes: bd82631d
      
       ("tracing: Add support for dynamic strings to synthetic events")
      Reviewed-by: default avatarTom Zanussi <zanussi@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      020d8cea
    • Haoran Luo's avatar
      tracing: Fix bug in rb_per_cpu_empty() that might cause deadloop. · 917a5bdd
      Haoran Luo authored
      commit 67f0d6d9 upstream.
      
      The "rb_per_cpu_empty()" misinterpret the condition (as not-empty) when
      "head_page" and "commit_page" of "struct ring_buffer_per_cpu" points to
      the same buffer page, whose "buffer_data_page" is empty and "read" field
      is non-zero.
      
      An error scenario could be constructed as followed (kernel perspective):
      
      1. All pages in the buffer has been accessed by reader(s) so that all of
      them will have non-zero "read" field.
      
      2. Read and clear all buffer pages so that "rb_num_of_entries()" will
      return 0 rendering there's no more data to read. It is also required
      that the "read_page", "commit_page" and "tail_page" points to the same
      page, while "head_page" is the next page of them.
      
      3. Invoke "ring_buffer_lock_reserve()" with large enough "length"
      so that it shot pass the end of current tail buffer page. Now the
      "head_page", "commit_page" and "tail_page" points to the same page.
      
      4. Discard current event with "ring_buffer_discard_commit()", so that
      "head_page", "commit_page" and "tail_page" points to a page whose buffer
      data page is now empty.
      
      When the error scenario has been constructed, "tracing_read_pipe" will
      be trapped inside a deadloop: "trace_empty()" returns 0 since
      "rb_per_cpu_empty()" returns 0 when it hits the CPU containing such
      constructed ring buffer. Then "trace_find_next_entry_inc()" always
      return NULL since "rb_num_of_entries()" reports there's no more entry
      to read. Finally "trace_seq_to_user()" returns "-EBUSY" spanking
      "tracing_read_pipe" back to the start of the "waitagain" loop.
      
      I've also written a proof-of-concept script to construct the scenario
      and trigger the bug automatically, you can use it to trace and validate
      my reasoning above:
      
        https://github.com/aegistudio/RingBufferDetonator.git
      
      Tests has been carried out on linux kernel 5.14-rc2
      (2734d6c1), my fixed version
      of kernel (for testing whether my update fixes the bug) and
      some older kernels (for range of affected kernels). Test result is
      also attached to the proof-of-concept repository.
      
      Link: https://lore.kernel.org/linux-trace-devel/YPaNxsIlb2yjSi5Y@aegistudio/
      Link: https://lore.kernel.org/linux-trace-devel/YPgrN85WL9VyrZ55@aegistudio
      
      Cc: stable@vger.kernel.org
      Fixes: bf41a158
      
       ("ring-buffer: make reentrant")
      Suggested-by: default avatarLinus Torvalds <torvalds@linuxfoundation.org>
      Signed-off-by: default avatarHaoran Luo <www@aegistudio.net>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      917a5bdd
    • Steven Rostedt (VMware)'s avatar
      tracing/histogram: Rename "cpu" to "common_cpu" · 29ecaddb
      Steven Rostedt (VMware) authored
      commit 1e3bac71 upstream.
      
      Currently the histogram logic allows the user to write "cpu" in as an
      event field, and it will record the CPU that the event happened on.
      
      The problem with this is that there's a lot of events that have "cpu"
      as a real field, and using "cpu" as the CPU it ran on, makes it
      impossible to run histograms on the "cpu" field of events.
      
      For example, if I want to have a histogram on the count of the
      workqueue_queue_work event on its cpu field, running:
      
       ># echo 'hist:keys=cpu' > events/workqueue/workqueue_queue_work/trigger
      
      Gives a misleading and wrong result.
      
      Change the command to "common_cpu" as no event should have "common_*"
      fields as that's a reserved name for fields used by all events. And
      this makes sense here as common_cpu would be a field used by all events.
      
      Now we can even do:
      
       ># echo 'hist:keys=common_cpu,cpu if cpu < 100' > events/workqueue/workqueue_queue_work/trigger
       ># cat events/workqueue/workqueue_queue_work/hist
       # event histogram
       #
       # trigger info: hist:keys=common_cpu,cpu:vals=hitcount:sort=hitcount:size=2048 if cpu < 100 [active]
       #
      
       { common_cpu:          0, cpu:          2 } hitcount:          1
       { common_cpu:          0, cpu:          4 } hitcount:          1
       { common_cpu:          7, cpu:          7 } hitcount:          1
       { common_cpu:          0, cpu:          7 } hitcount:          1
       { common_cpu:          0, cpu:          1 } hitcount:          1
       { common_cpu:          0, cpu:          6 } hitcount:          2
       { common_cpu:          0, cpu:          5 } hitcount:          2
       { common_cpu:          1, cpu:          1 } hitcount:          4
       { common_cpu:          6, cpu:          6 } hitcount:          4
       { common_cpu:          5, cpu:          5 } hitcount:         14
       { common_cpu:          4, cpu:          4 } hitcount:         26
       { common_cpu:          0, cpu:          0 } hitcount:         39
       { common_cpu:          2, cpu:          2 } hitcount:        184
      
      Now for backward compatibility, I added a trick. If "cpu" is used, and
      the field is not found, it will fall back to "common_cpu" and work as
      it did before. This way, it will still work for old programs that use
      "cpu" to get the actual CPU, but if the event has a "cpu" as a field, it
      will get that event's "cpu" field, which is probably what it wants
      anyway.
      
      I updated the tracefs/README to include documentation about both the
      common_timestamp and the common_cpu. This way, if that text is present in
      the README, then an application can know that common_cpu is supported over
      just plain "cpu".
      
      Link: https://lkml.kernel.org/r/20210721110053.26b4f641@oasis.local.home
      
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: stable@vger.kernel.org
      Fixes: 8b7622bf
      
       ("tracing: Add cpu field for hist triggers")
      Reviewed-by: default avatarTom Zanussi <zanussi@kernel.org>
      Reviewed-by: default avatarMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      29ecaddb
    • Steven Rostedt (VMware)'s avatar
      tracepoints: Update static_call before tp_funcs when adding a tracepoint · 58f47cfe
      Steven Rostedt (VMware) authored
      commit 352384d5 upstream.
      
      Because of the significant overhead that retpolines pose on indirect
      calls, the tracepoint code was updated to use the new "static_calls" that
      can modify the running code to directly call a function instead of using
      an indirect caller, and this function can be changed at runtime.
      
      In the tracepoint code that calls all the registered callbacks that are
      attached to a tracepoint, the following is done:
      
      	it_func_ptr = rcu_dereference_raw((&__tracepoint_##name)->funcs);
      	if (it_func_ptr) {
      		__data = (it_func_ptr)->data;
      		static_call(tp_func_##name)(__data, args);
      	}
      
      If there's just a single callback, the static_call is updated to just call
      that callback directly. Once another handler is added, then the static
      caller is updated to call the iterator, that simply loops over all the
      funcs in the array and calls each of the callbacks like the old method
      using indirect calling.
      
      The issue was discovered with a race between updating the funcs array and
      updating the static_call. The funcs array was updated first and then the
      static_call was updated. This is not an issue as long as the first element
      in the old array is the same as the first element in the new array. But
      that assumption is incorrect, because callbacks also have a priority
      field, and if there's a callback added that has a higher priority than the
      callback on the old array, then it will become the first callback in the
      new array. This means that it is possible to call the old callback with
      the new callback data element, which can cause a kernel panic.
      
      	static_call = callback1()
      	funcs[] = {callback1,data1};
      	callback2 has higher priority than callback1
      
      	CPU 1				CPU 2
      	-----				-----
      
         new_funcs = {callback2,data2},
                     {callback1,data1}
      
         rcu_assign_pointer(tp->funcs, new_funcs);
      
        /*
         * Now tp->funcs has the new array
         * but the static_call still calls callback1
         */
      
      				it_func_ptr = tp->funcs [ new_funcs ]
      				data = it_func_ptr->data [ data2 ]
      				static_call(callback1, data);
      
      				/* Now callback1 is called with
      				 * callback2's data */
      
      				[ KERNEL PANIC ]
      
         update_static_call(iterator);
      
      To prevent this from happening, always switch the static_call to the
      iterator before assigning the tp->funcs to the new array. The iterator will
      always properly match the callback with its data.
      
      To trigger this bug:
      
        In one terminal:
      
          while :; do hackbench 50; done
      
        In another terminal
      
          echo 1 > /sys/kernel/tracing/events/sched/sched_waking/enable
          while :; do
              echo 1 > /sys/kernel/tracing/set_event_pid;
              sleep 0.5
              echo 0 > /sys/kernel/tracing/set_event_pid;
              sleep 0.5
         done
      
      And it doesn't take long to crash. This is because the set_event_pid adds
      a callback to the sched_waking tracepoint with a high priority, which will
      be called before the sched_waking trace event callback is called.
      
      Note, the removal to a single callback updates the array first, before
      changing the static_call to single callback, which is the proper order as
      the first element in the array is the same as what the static_call is
      being changed to.
      
      Link: https://lore.kernel.org/io-uring/4ebea8f0-58c9-e571-fd30-0ce4f6f09c70@samba.org/
      
      Cc: stable@vger.kernel.org
      Fixes: d25e37d8
      
       ("tracepoint: Optimize using static_call()")
      Reported-by: default avatarStefan Metzmacher <metze@samba.org>
      tested-by: default avatarStefan Metzmacher <metze@samba.org>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      58f47cfe
    • Marc Zyngier's avatar
      firmware/efi: Tell memblock about EFI iomem reservations · 0ea2fd39
      Marc Zyngier authored
      commit 2bab693a
      
       upstream.
      
      kexec_load_file() relies on the memblock infrastructure to avoid
      stamping over regions of memory that are essential to the survival
      of the system.
      
      However, nobody seems to agree how to flag these regions as reserved,
      and (for example) EFI only publishes its reservations in /proc/iomem
      for the benefit of the traditional, userspace based kexec tool.
      
      On arm64 platforms with GICv3, this can result in the payload being
      placed at the location of the LPI tables. Shock, horror!
      
      Let's augment the EFI reservation code with a memblock_reserve() call,
      protecting our dear tables from the secondary kernel invasion.
      
      Reported-by: default avatarMoritz Fischer <mdf@kernel.org>
      Tested-by: default avatarMoritz Fischer <mdf@kernel.org>
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      Cc: stable@vger.kernel.org
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: James Morse <james.morse@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarArd Biesheuvel <ardb@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0ea2fd39
    • Amelie Delaunay's avatar
      usb: typec: stusb160x: Don't block probing of consumer of "connector" nodes · 68a4037d
      Amelie Delaunay authored
      commit 6b633767 upstream.
      
      Similar as with tcpm this patch lets fw_devlink know not to wait on the
      fwnode to be populated as a struct device.
      
      Without this patch, USB functionality can be broken on some previously
      supported boards.
      
      Fixes: 28ec344b
      
       ("usb: typec: tcpm: Don't block probing of consumers of "connector" nodes")
      Cc: stable <stable@vger.kernel.org>
      Signed-off-by: default avatarAmelie Delaunay <amelie.delaunay@foss.st.com>
      Link: https://lore.kernel.org/r/20210716120718.20398-3-amelie.delaunay@foss.st.com
      
      
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      68a4037d
    • Amelie Delaunay's avatar
      usb: typec: stusb160x: register role switch before interrupt registration · eeb18490
      Amelie Delaunay authored
      commit 86762ad4 upstream.
      
      During interrupt registration, attach state is checked. If attached,
      then the Type-C state is updated with typec_set_xxx functions and role
      switch is set with usb_role_switch_set_role().
      
      If the usb_role_switch parameter is error or null, the function simply
      returns 0.
      
      So, to update usb_role_switch role if a device is attached before the
      irq is registered, usb_role_switch must be registered before irq
      registration.
      
      Fixes: da0cb631
      
       ("usb: typec: add support for STUSB160x Type-C controller family")
      Cc: stable <stable@vger.kernel.org>
      Signed-off-by: default avatarAmelie Delaunay <amelie.delaunay@foss.st.com>
      Link: https://lore.kernel.org/r/20210716120718.20398-2-amelie.delaunay@foss.st.com
      
      
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      eeb18490
    • Martin Kepplinger's avatar
      usb: typec: tipd: Don't block probing of consumer of "connector" nodes · 703527bf
      Martin Kepplinger authored
      commit 57560ee9 upstream.
      
      Similar as with tcpm this patch lets fw_devlink know not to wait on the
      fwnode to be populated as a struct device.
      
      Without this patch, USB functionality can be broken on some previously
      supported boards.
      
      Fixes: 28ec344b
      
       ("usb: typec: tcpm: Don't block probing of consumers of "connector" nodes")
      Cc: stable <stable@vger.kernel.org>
      Acked-by: default avatarHeikki Krogerus <heikki.krogerus@linux.intel.com>
      Signed-off-by: default avatarMartin Kepplinger <martin.kepplinger@puri.sm>
      Link: https://lore.kernel.org/r/20210714061807.5737-1-martin.kepplinger@puri.sm
      
      
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      703527bf
    • Minas Harutyunyan's avatar
      usb: dwc2: gadget: Fix sending zero length packet in DDMA mode. · 61c12921
      Minas Harutyunyan authored
      commit d53dc388 upstream.
      
      Sending zero length packet in DDMA mode perform by DMA descriptor
      by setting SP (short packet) flag.
      
      For DDMA in function dwc2_hsotg_complete_in() does not need to send
      zlp.
      
      Tested by USBCV MSC tests.
      
      Fixes: f71b5e25
      
       ("usb: dwc2: gadget: fix zero length packet transfers")
      Cc: stable <stable@vger.kernel.org>
      Signed-off-by: default avatarMinas Harutyunyan <Minas.Harutyunyan@synopsys.com>
      Link: https://lore.kernel.org/r/967bad78c55dd2db1c19714eee3d0a17cf99d74a.1626777738.git.Minas.Harutyunyan@synopsys.com
      
      
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      61c12921
    • Minas Harutyunyan's avatar
      usb: dwc2: gadget: Fix GOUTNAK flow for Slave mode. · bd062872
      Minas Harutyunyan authored
      commit fecb3a17 upstream.
      
      Because of dwc2_hsotg_ep_stop_xfr() function uses poll
      mode, first need to mask GINTSTS_GOUTNAKEFF interrupt.
      In Slave mode GINTSTS_GOUTNAKEFF interrupt will be
      aserted only after pop OUT NAK status packet from RxFIFO.
      
      In dwc2_hsotg_ep_sethalt() function before setting
      DCTL_SGOUTNAK need to unmask GOUTNAKEFF interrupt.
      
      Tested by USBCV CH9 and MSC tests set in Slave, BDMA and DDMA.
      All tests are passed.
      
      Fixes: a4f82771 ("usb: dwc2: gadget: Disable enabled HW endpoint in dwc2_hsotg_ep_disable")
      Fixes: 6070636c
      
       ("usb: dwc2: Fix Stalling a Non-Isochronous OUT EP")
      Cc: stable <stable@vger.kernel.org>
      Signed-off-by: default avatarMinas Harutyunyan <Minas.Harutyunyan@synopsys.com>
      Link: https://lore.kernel.org/r/e17fad802bbcaf879e1ed6745030993abb93baf8.1626152924.git.Minas.Harutyunyan@synopsys.com
      
      
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bd062872
    • Marek Szyprowski's avatar
      usb: dwc2: Skip clock gating on Samsung SoCs · 36b53430
      Marek Szyprowski authored
      commit c4a0f7a6 upstream.
      
      Commit 0112b7ce ("usb: dwc2: Update dwc2_handle_usb_suspend_intr
      function.") changed the way the driver handles power down modes in a such
      way that it uses clock gating when no other power down mode is available.
      
      This however doesn't work well on the DWC2 implementation used on the
      Samsung SoCs. When a clock gating is enabled, system hangs. It looks that
      the proper clock gating requires some additional glue code in the shared
      USB2 PHY and/or Samsung glue code for the DWC2. To restore driver
      operation on the Samsung SoCs simply skip enabling clock gating mode
      until one finds what is really needed to make it working reliably.
      
      Fixes: 0112b7ce
      
       ("usb: dwc2: Update dwc2_handle_usb_suspend_intr function.")
      Cc: stable <stable@vger.kernel.org>
      Acked-by: default avatarKrzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
      Signed-off-by: default avatarMarek Szyprowski <m.szyprowski@samsung.com>
      Link: https://lore.kernel.org/r/20210716050127.4406-1-m.szyprowski@samsung.com
      
      
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      36b53430
    • Zhang Qilong's avatar
      usb: gadget: Fix Unbalanced pm_runtime_enable in tegra_xudc_probe · b85e8638
      Zhang Qilong authored
      commit 5b012481 upstream.
      
      Add missing pm_runtime_disable() when probe error out. It could
      avoid pm_runtime implementation complains when removing and probing
      again the driver.
      
      Fixes: 49db4272
      
       ("usb: gadget: Add UDC driver for tegra XUSB device mode controller")
      Cc: stable <stable@vger.kernel.org>
      Signed-off-by: default avatarZhang Qilong <zhangqilong3@huawei.com>
      Link: https://lore.kernel.org/r/20210618141441.107817-1-zhangqilong3@huawei.com
      
      
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b85e8638
    • John Keeping's avatar
      USB: serial: cp210x: add ID for CEL EM3588 USB ZigBee stick · 7138b108
      John Keeping authored
      commit d6a206e6
      
       upstream.
      
      Add the USB serial device ID for the CEL ZigBee EM3588 radio stick.
      
      Signed-off-by: default avatarJohn Keeping <john@metanate.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJohan Hovold <johan@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7138b108
    • Ian Ray's avatar
      USB: serial: cp210x: fix comments for GE CS1000 · f1a01c2b
      Ian Ray authored
      commit e9db418d upstream.
      
      Fix comments for GE CS1000 CP210x USB ID assignments.
      
      Fixes: 42213a01
      
       ("USB: serial: cp210x: add some more GE USB IDs")
      Signed-off-by: default avatarIan Ray <ian.ray@ge.com>
      Signed-off-by: default avatarSebastian Reichel <sebastian.reichel@collabora.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJohan Hovold <johan@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f1a01c2b
    • Marco De Marco's avatar
      USB: serial: option: add support for u-blox LARA-R6 family · 8a55cb17
      Marco De Marco authored
      commit 94b619a0
      
       upstream.
      
      The patch is meant to support LARA-R6 Cat 1 module family.
      
      Module USB ID:
      Vendor  ID: 0x05c6
      Product ID: 0x90fA
      
      Interface layout:
      If 0: Diagnostic
      If 1: AT parser
      If 2: AT parser
      If 3: QMI wwan (not available in all versions)
      
      Signed-off-by: default avatarMarco De Marco <marco.demarco@posteo.net>
      Link: https://lore.kernel.org/r/49260184.kfMIbaSn9k@mars
      
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJohan Hovold <johan@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8a55cb17
    • Yoshihiro Shimoda's avatar
      usb: renesas_usbhs: Fix superfluous irqs happen after usb_pkt_pop() · c9d143a3
      Yoshihiro Shimoda authored
      commit 5719df24 upstream.
      
      This driver has a potential issue which this driver is possible to
      cause superfluous irqs after usb_pkt_pop() is called. So, after
      the commit 3af32605 ("usb: renesas_usbhs: fix error return
      code of usbhsf_pkt_handler()") had been applied, we could observe
      the following error happened when we used g_audio.
      
          renesas_usbhs e6590000.usb: irq_ready run_error 1 : -22
      
      To fix the issue, disable the tx or rx interrupt in usb_pkt_pop().
      
      Fixes: 2743e7f9
      
       ("usb: renesas_usbhs: fix the usb_pkt_pop()")
      Cc: <stable@vger.kernel.org> # v4.4+
      Signed-off-by: default avatarYoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
      Link: https://lore.kernel.org/r/20210624122039.596528-1-yoshihiro.shimoda.uh@renesas.com
      
      
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c9d143a3
    • Mark Tomlinson's avatar
      usb: max-3421: Prevent corruption of freed memory · d4179cdb
      Mark Tomlinson authored
      commit b5fdf5c6 upstream.
      
      The MAX-3421 USB driver remembers the state of the USB toggles for a
      device/endpoint. To save SPI writes, this was only done when a new
      device/endpoint was being used. Unfortunately, if the old device was
      removed, this would cause writes to freed memory.
      
      To fix this, a simpler scheme is used. The toggles are read from
      hardware when a URB is completed, and the toggles are always written to
      hardware when any URB transaction is started. This will cause a few more
      SPI transactions, but no causes kernel panics.
      
      Fixes: 2d53139f
      
       ("Add support for using a MAX3421E chip as a host driver.")
      Cc: stable <stable@vger.kernel.org>
      Signed-off-by: default avatarMark Tomlinson <mark.tomlinson@alliedtelesis.co.nz>
      Link: https://lore.kernel.org/r/20210625031456.8632-1-mark.tomlinson@alliedtelesis.co.nz
      
      
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d4179cdb
    • Julian Sikorski's avatar
      USB: usb-storage: Add LaCie Rugged USB3-FW to IGNORE_UAS · 3b5d8c72
      Julian Sikorski authored
      commit 6abf2fe6
      
       upstream.
      
      LaCie Rugged USB3-FW appears to be incompatible with UAS. It generates
      errors like:
      [ 1151.582598] sd 14:0:0:0: tag#16 uas_eh_abort_handler 0 uas-tag 1 inflight: IN
      [ 1151.582602] sd 14:0:0:0: tag#16 CDB: Report supported operation codes a3 0c 01 12 00 00 00 00 02 00 00 00
      [ 1151.588594] scsi host14: uas_eh_device_reset_handler start
      [ 1151.710482] usb 2-4: reset SuperSpeed Gen 1 USB device number 2 using xhci_hcd
      [ 1151.741398] scsi host14: uas_eh_device_reset_handler success
      [ 1181.785534] scsi host14: uas_eh_device_reset_handler start
      
      Signed-off-by: default avatarJulian Sikorski <belegdol+github@gmail.com>
      Cc: stable <stable@vger.kernel.org>
      Link: https://lore.kernel.org/r/20210720171910.36497-1-belegdol+github@gmail.com
      
      
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3b5d8c72
    • Mathias Nyman's avatar
      usb: hub: Fix link power management max exit latency (MEL) calculations · 9499b2d2
      Mathias Nyman authored
      commit 1bf2761c
      
       upstream.
      
      Maximum Exit Latency (MEL) value is used by host to know how much in
      advance it needs to start waking up a U1/U2 suspended link in order to
      service a periodic transfer in time.
      
      Current MEL calculation only includes the time to wake up the path from
      U1/U2 to U0. This is called tMEL1 in USB 3.1 section C 1.5.2
      
      Total MEL = tMEL1 + tMEL2 +tMEL3 + tMEL4 which should additinally include:
      - tMEL2 which is the time it takes for PING message to reach device
      - tMEL3 time for device to process the PING and submit a PING_RESPONSE
      - tMEL4 time for PING_RESPONSE to traverse back upstream to host.
      
      Add the missing tMEL2, tMEL3 and tMEL4 to MEL calculation.
      
      Cc: <stable@kernel.org> # v3.5
      Signed-off-by: default avatarMathias Nyman <mathias.nyman@linux.intel.com>
      Link: https://lore.kernel.org/r/20210715150122.1995966-1-mathias.nyman@linux.intel.com
      
      
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9499b2d2
    • Mathias Nyman's avatar
      usb: hub: Disable USB 3 device initiated lpm if exit latency is too high · c7affd5b
      Mathias Nyman authored
      commit 1b7f56fb
      
       upstream.
      
      The device initiated link power management U1/U2 states should not be
      enabled in case the system exit latency plus one bus interval (125us) is
      greater than the shortest service interval of any periodic endpoint.
      
      This is the case for both U1 and U2 sytstem exit latencies and link states.
      
      See USB 3.2 section 9.4.9 "Set Feature" for more details
      
      Note, before this patch the host and device initiated U1/U2 lpm states
      were both enabled with lpm. After this patch it's possible to end up with
      only host inititated U1/U2 lpm in case the exit latencies won't allow
      device initiated lpm.
      
      If this case we still want to set the udev->usb3_lpm_ux_enabled flag so
      that sysfs users can see the link may go to U1/U2.
      
      Signed-off-by: default avatarMathias Nyman <mathias.nyman@linux.intel.com>
      Cc: stable <stable@vger.kernel.org>
      Link: https://lore.kernel.org/r/20210715150122.1995966-2-mathias.nyman@linux.intel.com
      
      
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c7affd5b
    • Nicholas Piggin's avatar
      KVM: PPC: Book3S HV Nested: Sanitise H_ENTER_NESTED TM state · 1408e47a
      Nicholas Piggin authored
      commit d9c57d3e upstream.
      
      The H_ENTER_NESTED hypercall is handled by the L0, and it is a request
      by the L1 to switch the context of the vCPU over to that of its L2
      guest, and return with an interrupt indication. The L1 is responsible
      for switching some registers to guest context, and the L0 switches
      others (including all the hypervisor privileged state).
      
      If the L2 MSR has TM active, then the L1 is responsible for
      recheckpointing the L2 TM state. Then the L1 exits to L0 via the
      H_ENTER_NESTED hcall, and the L0 saves the TM state as part of the exit,
      and then it recheckpoints the TM state as part of the nested entry and
      finally HRFIDs into the L2 with TM active MSR. Not efficient, but about
      the simplest approach for something that's horrendously complicated.
      
      Problems arise if the L1 exits to the L0 with a TM state which does not
      match the L2 TM state being requested. For example if the L1 is
      transactional but the L2 MSR is non-transactional, or vice versa. The
      L0's HRFID can take a TM Bad Thing interrupt and crash.
      
      Fix this by disallowing H_ENTER_NESTED in TM[T] state entirely, and then
      ensuring that if the L1 is suspended then the L2 must have TM active,
      and if the L1 is not suspended then the L2 must not have TM active.
      
      Fixes: 360cae31
      
       ("KVM: PPC: Book3S HV: Nested guest entry via hypercall")
      Cc: stable@vger.kernel.org # v4.20+
      Reported-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Acked-by: default avatarMichael Neuling <mikey@neuling.org>
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1408e47a
    • Nicholas Piggin's avatar
      KVM: PPC: Book3S: Fix H_RTAS rets buffer overflow · 35e114e6
      Nicholas Piggin authored
      commit f62f3c20 upstream.
      
      The kvmppc_rtas_hcall() sets the host rtas_args.rets pointer based on
      the rtas_args.nargs that was provided by the guest. That guest nargs
      value is not range checked, so the guest can cause the host rets pointer
      to be pointed outside the args array. The individual rtas function
      handlers check the nargs and nrets values to ensure they are correct,
      but if they are not, the handlers store a -3 (0xfffffffd) failure
      indication in rets[0] which corrupts host memory.
      
      Fix this by testing up front whether the guest supplied nargs and nret
      would exceed the array size, and fail the hcall directly without storing
      a failure indication to rets[0].
      
      Also expand on a comment about why we kill the guest and try not to
      return errors directly if we have a valid rets[0] pointer.
      
      Fixes: 8e591cb7
      
       ("KVM: PPC: Book3S: Add infrastructure to implement kernel-side RTAS calls")
      Cc: stable@vger.kernel.org # v3.10+
      Reported-by: default avatarAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      35e114e6
    • David Jeffery's avatar
      usb: ehci: Prevent missed ehci interrupts with edge-triggered MSI · 3d98808e
      David Jeffery authored
      commit 0b605572 upstream.
      
      When MSI is used by the ehci-hcd driver, it can cause lost interrupts which
      results in EHCI only continuing to work due to a polling fallback. But the
      reliance of polling drastically reduces performance of any I/O through EHCI.
      
      Interrupts are lost as the EHCI interrupt handler does not safely handle
      edge-triggered interrupts. It fails to ensure all interrupt status bits are
      cleared, which works with level-triggered interrupts but not the
      edge-triggered interrupts typical from using MSI.
      
      To fix this problem, check if the driver may have raced with the hardware
      setting additional interrupt status bits and clear status until it is in a
      stable state.
      
      Fixes: 306c54d0
      
       ("usb: hcd: Try MSI interrupts on PCI devices")
      Tested-by: default avatarLaurence Oberman <loberman@redhat.com>
      Reviewed-by: default avatarAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Acked-by: default avatarAlan Stern <stern@rowland.harvard.edu>
      Signed-off-by: default avatarDavid Jeffery <djeffery@redhat.com>
      Link: https://lore.kernel.org/r/20210715213744.GA44506@redhat
      
      
      Cc: stable <stable@vger.kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3d98808e