Skip to content
  1. Jul 28, 2021
    • Stefan Wahren's avatar
      ARM: multi_v7_defconfig: Make NOP_USB_XCEIV driver built-in · 337deea6
      Stefan Wahren authored
      commit ab37a7a8 upstream.
      
      The usage of usb-nop-xceiv PHY on Raspberry Pi boards with BCM283x has
      been a "regression source" a lot of times. The last case is breakage of
      USB mass storage boot has been commit e5904747 ("driver core: Set
      fw_devlink=on by default") for multi_v7_defconfig. As long as
      NOP_USB_XCEIV is configured as module, the dwc2 USB driver defer probing
      endlessly and prevent booting from USB mass storage device. So make
      the driver built-in as in bcm2835_defconfig and arm64/defconfig.
      
      Fixes: e5904747
      
       ("driver core: Set fw_devlink=on by default")
      Reported-by: default avatarOjaswin Mujoo <ojaswin98@gmail.com>
      Signed-off-by: default avatarStefan Wahren <stefan.wahren@i2se.com>
      Link: https://lore.kernel.org/r/1625915095-23077-1-git-send-email-stefan.wahren@i2se.com
      
      '
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      337deea6
    • Paul Blakey's avatar
      skbuff: Release nfct refcount on napi stolen or re-used skbs · a5fd9d3d
      Paul Blakey authored
      commit 8550ff8d upstream.
      
      When multiple SKBs are merged to a new skb under napi GRO,
      or SKB is re-used by napi, if nfct was set for them in the
      driver, it will not be released while freeing their stolen
      head state or on re-use.
      
      Release nfct on napi's stolen or re-used SKBs, and
      in gro_list_prepare, check conntrack metadata diff.
      
      Fixes: 5c6b9460
      
       ("net/mlx5e: CT: Handle misses after executing CT action")
      Reviewed-by: default avatarRoi Dayan <roid@nvidia.com>
      Signed-off-by: default avatarPaul Blakey <paulb@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a5fd9d3d
    • Matthieu Baerts's avatar
      mptcp: fix 'masking a bool' warning · 8f738d2d
      Matthieu Baerts authored
      commit c4512c63 upstream.
      
      Dan Carpenter reported an issue introduced in
      commit fde56eea ("mptcp: refine mptcp_cleanup_rbuf") where a new
      boolean (ack_pending) is masked with 0x9.
      
      This is not the intention to ignore values by using a boolean. This
      variable should not have a 'bool' type: we should keep the 'u8' to allow
      this comparison.
      
      Fixes: fde56eea
      
       ("mptcp: refine mptcp_cleanup_rbuf")
      Reported-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Acked-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8f738d2d
    • Mahesh Bandewar's avatar
      bonding: fix build issue · ecc9318d
      Mahesh Bandewar authored
      commit 5b69874f upstream.
      
      The commit 9a560550 (" bonding: Add struct bond_ipesc to manage SA") is causing
      following build error when XFRM is not selected in kernel config.
      
      lld: error: undefined symbol: xfrm_dev_state_flush
      >>> referenced by bond_main.c:3453 (drivers/net/bonding/bond_main.c:3453)
      >>>               net/bonding/bond_main.o:(bond_netdev_event) in archive drivers/built-in.a
      
      Fixes: 9a560550
      
       (" bonding: Add struct bond_ipesc to manage SA")
      Signed-off-by: default avatarMahesh Bandewar <maheshb@google.com>
      CC: Taehee Yoo <ap420073@gmail.com>
      CC: Jay Vosburgh <jay.vosburgh@canonical.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ecc9318d
    • Yoshitaka Ikeda's avatar
      da510a38
    • Likun Gao's avatar
      drm/amdgpu: update golden setting for sienna_cichlid · bc93e990
      Likun Gao authored
      commit 3e94b596
      
       upstream.
      
      Update GFX golden setting for sienna_cichlid.
      
      Signed-off-by: default avatarLikun Gao <Likun.Gao@amd.com>
      Reviewed-by: default avatarHawking Zhang <Hawking.Zhang@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bc93e990
    • Xiaojian Du's avatar
      drm/amdgpu: update the golden setting for vangogh · 52ee22ce
      Xiaojian Du authored
      commit 4fff6fbc
      
       upstream.
      
      This patch is to update the golden setting for vangogh.
      
      Signed-off-by: default avatarXiaojian Du <Xiaojian.Du@amd.com>
      Reviewed-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      52ee22ce
    • Tao Zhou's avatar
      drm/amdgpu: update gc golden setting for dimgrey_cavefish · 72097f7b
      Tao Zhou authored
      commit cfe4e8f0
      
       upstream.
      
      Update gc_10_3_4 golden setting.
      
      Signed-off-by: default avatarTao Zhou <tao.zhou1@amd.com>
      Reviewed-by: default avatarGuchun Chen <guchun.chen@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      72097f7b
    • Charles Baylis's avatar
      drm: Return -ENOTTY for non-drm ioctls · 75ab00b8
      Charles Baylis authored
      commit 3abab27c
      
       upstream.
      
      drm: Return -ENOTTY for non-drm ioctls
      
      Return -ENOTTY from drm_ioctl() when userspace passes in a cmd number
      which doesn't relate to the drm subsystem.
      
      Glibc uses the TCGETS ioctl to implement isatty(), and without this
      change isatty() returns it incorrectly returns true for drm devices.
      
      To test run this command:
      $ if [ -t 0 ]; then echo is a tty; fi < /dev/dri/card0
      which shows "is a tty" without this patch.
      
      This may also modify memory which the userspace application is not
      expecting.
      
      Signed-off-by: default avatarCharles Baylis <cb-kernel@fishzet.co.uk>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarDaniel Vetter <daniel.vetter@ffwll.ch>
      Link: https://patchwork.freedesktop.org/patch/msgid/YPG3IBlzaMhfPqCr@stando.fishzet.co.uk
      
      
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      75ab00b8
    • Adrian Hunter's avatar
      driver core: Prevent warning when removing a device link from unregistered consumer · c9d31f7d
      Adrian Hunter authored
      commit e64daad6 upstream.
      
      sysfs_remove_link() causes a warning if the parent directory does not
      exist. That can happen if the device link consumer has not been registered.
      So do not attempt sysfs_remove_link() in that case.
      
      Fixes: 287905e6
      
       ("driver core: Expose device link details in sysfs")
      Signed-off-by: default avatarAdrian Hunter <adrian.hunter@intel.com>
      Cc: stable@vger.kernel.org # 5.9+
      Reviewed-by: default avatarRafael J. Wysocki <rafael@kernel.org>
      Link: https://lore.kernel.org/r/20210716114408.17320-2-adrian.hunter@intel.com
      
      
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c9d31f7d
    • Greg Kroah-Hartman's avatar
      nds32: fix up stack guard gap · 9d06d3d2
      Greg Kroah-Hartman authored
      commit c453db6c upstream.
      
      Commit 1be7107f
      
       ("mm: larger stack guard gap, between vmas") fixed
      up all architectures to deal with the stack guard gap.  But when nds32
      was added to the tree, it forgot to do the same thing.
      
      Resolve this by properly fixing up the nsd32's version of
      arch_get_unmapped_area()
      
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Vincent Chen <deanbo422@gmail.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Qiang Liu <cyruscyliu@gmail.com>
      Cc: stable <stable@vger.kernel.org>
      Reported-by: default avatariLifetruth <yixiaonn@gmail.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Link: https://lore.kernel.org/r/20210629104024.2293615-1-gregkh@linuxfoundation.org
      
      
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9d06d3d2
    • Jérôme Glisse's avatar
      misc: eeprom: at24: Always append device id even if label property is set. · 7544d21b
      Jérôme Glisse authored
      commit c36748ac upstream.
      
      We need to append device id even if eeprom have a label property set as some
      platform can have multiple eeproms with same label and we can not register
      each of those with same label. Failing to register those eeproms trigger
      cascade failures on such platform (system is no longer working).
      
      This fix regression on such platform introduced with 4e302c3b
      
      
      
      Reported-by: default avatarAlexander Fomichev <fomichev.ru@gmail.com>
      Fixes: 4e302c3b
      
       ("misc: eeprom: at24: fix NVMEM name with custom AT24 device name")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: default avatarBartosz Golaszewski <bgolaszewski@baylibre.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7544d21b
    • Ilya Dryomov's avatar
      rbd: always kick acquire on "acquired" and "released" notifications · 6ef92931
      Ilya Dryomov authored
      commit 8798d070
      
       upstream.
      
      Skipping the "lock has been released" notification if the lock owner
      is not what we expect based on owner_cid can lead to I/O hangs.
      One example is our own notifications: because owner_cid is cleared
      in rbd_unlock(), when we get our own notification it is processed as
      unexpected/duplicate and maybe_kick_acquire() isn't called.  If a peer
      that requested the lock then doesn't go through with acquiring it,
      I/O requests that came in while the lock was being quiesced would
      be stalled until another I/O request is submitted and kicks acquire
      from rbd_img_exclusive_lock().
      
      This makes the comment in rbd_release_lock() actually true: prior to
      this change the canceled work was being requeued in response to the
      "lock has been acquired" notification from rbd_handle_acquired_lock().
      
      Cc: stable@vger.kernel.org # 5.3+
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      Tested-by: default avatarRobin Geuze <robin.geuze@nl.team.blue>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6ef92931
    • Ilya Dryomov's avatar
      rbd: don't hold lock_rwsem while running_list is being drained · 8b334d74
      Ilya Dryomov authored
      commit ed9eb710 upstream.
      
      Currently rbd_quiesce_lock() holds lock_rwsem for read while blocking
      on releasing_wait completion.  On the I/O completion side, each image
      request also needs to take lock_rwsem for read.  Because rw_semaphore
      implementation doesn't allow new readers after a writer has indicated
      interest in the lock, this can result in a deadlock if something that
      needs to take lock_rwsem for write gets involved.  For example:
      
      1. watch error occurs
      2. rbd_watch_errcb() takes lock_rwsem for write, clears owner_cid and
         releases lock_rwsem
      3. after reestablishing the watch, rbd_reregister_watch() takes
         lock_rwsem for write and calls rbd_reacquire_lock()
      4. rbd_quiesce_lock() downgrades lock_rwsem to for read and blocks on
         releasing_wait until running_list becomes empty
      5. another watch error occurs
      6. rbd_watch_errcb() blocks trying to take lock_rwsem for write
      7. no in-flight image request can complete and delete itself from
         running_list because lock_rwsem won't be granted anymore
      
      A similar scenario can occur with "lock has been acquired" and "lock
      has been released" notification handers which also take lock_rwsem for
      write to update owner_cid.
      
      We don't actually get anything useful from sitting on lock_rwsem in
      rbd_quiesce_lock() -- owner_cid updates certainly don't need to be
      synchronized with.  In fact the whole owner_cid tracking logic could
      probably be removed from the kernel client because we don't support
      proxied maintenance operations.
      
      Cc: stable@vger.kernel.org # 5.3+
      URL: https://tracker.ceph.com/issues/42757
      
      
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      Tested-by: default avatarRobin Geuze <robin.geuze@nl.team.blue>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8b334d74
    • Mike Kravetz's avatar
      hugetlbfs: fix mount mode command line processing · 79da14fa
      Mike Kravetz authored
      commit e0f7e2b2 upstream.
      
      In commit 32021982 ("hugetlbfs: Convert to fs_context") processing
      of the mount mode string was changed from match_octal() to fsparam_u32.
      
      This changed existing behavior as match_octal does not require octal
      values to have a '0' prefix, but fsparam_u32 does.
      
      Use fsparam_u32oct which provides the same behavior as match_octal.
      
      Link: https://lkml.kernel.org/r/20210721183326.102716-1-mike.kravetz@oracle.com
      Fixes: 32021982
      
       ("hugetlbfs: Convert to fs_context")
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reported-by: default avatarDennis Camera <bugs+kernel.org@dtnr.ch>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      79da14fa
    • Qi Zheng's avatar
      mm: fix the deadlock in finish_fault() · 4861f6d3
      Qi Zheng authored
      commit e4dc3489 upstream.
      
      Commit 63f3655f ("mm, memcg: fix reclaim deadlock with writeback")
      fix the following ABBA deadlock by pre-allocating the pte page table
      without holding the page lock.
      
      	                                lock_page(A)
                                              SetPageWriteback(A)
                                              unlock_page(A)
        lock_page(B)
                                              lock_page(B)
        pte_alloc_one
          shrink_page_list
            wait_on_page_writeback(A)
                                              SetPageWriteback(B)
                                              unlock_page(B)
      
                                              # flush A, B to clear the writeback
      
      Commit f9ce0be7 ("mm: Cleanup faultaround and finish_fault()
      codepaths") reworked the relevant code but ignored this race.  This will
      cause the deadlock above to appear again, so fix it.
      
      Link: https://lkml.kernel.org/r/20210721074849.57004-1-zhengqi.arch@bytedance.com
      Fixes: f9ce0be7
      
       ("mm: Cleanup faultaround and finish_fault() codepaths")
      Signed-off-by: default avatarQi Zheng <zhengqi.arch@bytedance.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4861f6d3
    • Mike Rapoport's avatar
      memblock: make for_each_mem_range() traverse MEMBLOCK_HOTPLUG regions · 5d4b4d2e
      Mike Rapoport authored
      commit 79e482e9 upstream.
      
      Commit b10d6bca ("arch, drivers: replace for_each_membock() with
      for_each_mem_range()") didn't take into account that when there is
      movable_node parameter in the kernel command line, for_each_mem_range()
      would skip ranges marked with MEMBLOCK_HOTPLUG.
      
      The page table setup code in POWER uses for_each_mem_range() to create
      the linear mapping of the physical memory and since the regions marked
      as MEMORY_HOTPLUG are skipped, they never make it to the linear map.
      
      A later access to the memory in those ranges will fail:
      
        BUG: Unable to handle kernel data access on write at 0xc000000400000000
        Faulting instruction address: 0xc00000000008a3c0
        Oops: Kernel access of bad area, sig: 11 [#1]
        LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
        Modules linked in:
        CPU: 0 PID: 53 Comm: kworker/u2:0 Not tainted 5.13.0 #7
        NIP:  c00000000008a3c0 LR: c0000000003c1ed8 CTR: 0000000000000040
        REGS: c000000008a57770 TRAP: 0300   Not tainted  (5.13.0)
        MSR:  8000000002009033 <SF,VEC,EE,ME,IR,DR,RI,LE>  CR: 84222202  XER: 20040000
        CFAR: c0000000003c1ed4 DAR: c000000400000000 DSISR: 42000000 IRQMASK: 0
        GPR00: c0000000003c1ed8 c000000008a57a10 c0000000019da700 c000000400000000
        GPR04: 0000000000000280 0000000000000180 0000000000000400 0000000000000200
        GPR08: 0000000000000100 0000000000000080 0000000000000040 0000000000000300
        GPR12: 0000000000000380 c000000001bc0000 c0000000001660c8 c000000006337e00
        GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
        GPR20: 0000000040000000 0000000020000000 c000000001a81990 c000000008c30000
        GPR24: c000000008c20000 c000000001a81998 000fffffffff0000 c000000001a819a0
        GPR28: c000000001a81908 c00c000001000000 c000000008c40000 c000000008a64680
        NIP clear_user_page+0x50/0x80
        LR __handle_mm_fault+0xc88/0x1910
        Call Trace:
          __handle_mm_fault+0xc44/0x1910 (unreliable)
          handle_mm_fault+0x130/0x2a0
          __get_user_pages+0x248/0x610
          __get_user_pages_remote+0x12c/0x3e0
          get_arg_page+0x54/0xf0
          copy_string_kernel+0x11c/0x210
          kernel_execve+0x16c/0x220
          call_usermodehelper_exec_async+0x1b0/0x2f0
          ret_from_kernel_thread+0x5c/0x70
        Instruction dump:
        79280fa4 79271764 79261f24 794ae8e2 7ca94214 7d683a14 7c893a14 7d893050
        7d4903a6 60000000 60000000 60000000 <7c001fec> 7c091fec 7c081fec 7c051fec
        ---[ end trace 490b8c67e6075e09 ]---
      
      Making for_each_mem_range() include MEMBLOCK_HOTPLUG regions in the
      traversal fixes this issue.
      
      Link: https://bugzilla.redhat.com/show_bug.cgi?id=1976100
      Link: https://lkml.kernel.org/r/20210712071132.20902-1-rppt@kernel.org
      Fixes: b10d6bca
      
       ("arch, drivers: replace for_each_membock() with for_each_mem_range()")
      Signed-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Tested-by: default avatarGreg Kurz <groug@kaod.org>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: <stable@vger.kernel.org>	[5.10+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5d4b4d2e
    • Sergei Trofimovich's avatar
      mm: page_alloc: fix page_poison=1 / INIT_ON_ALLOC_DEFAULT_ON interaction · 0e88a5be
      Sergei Trofimovich authored
      commit 69e5d322 upstream.
      
      To reproduce the failure we need the following system:
      
       - kernel command: page_poison=1 init_on_free=0 init_on_alloc=0
      
       - kernel config:
          * CONFIG_INIT_ON_ALLOC_DEFAULT_ON=y
          * CONFIG_INIT_ON_FREE_DEFAULT_ON=y
          * CONFIG_PAGE_POISONING=y
      
      Resulting in:
      
          0000000085629bdd: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
          0000000022861832: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
          00000000c597f5b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
          CPU: 11 PID: 15195 Comm: bash Kdump: loaded Tainted: G     U     O      5.13.1-gentoo-x86_64 #1
          Hardware name: System manufacturer System Product Name/PRIME Z370-A, BIOS 2801 01/13/2021
          Call Trace:
           dump_stack+0x64/0x7c
           __kernel_unpoison_pages.cold+0x48/0x84
           post_alloc_hook+0x60/0xa0
           get_page_from_freelist+0xdb8/0x1000
           __alloc_pages+0x163/0x2b0
           __get_free_pages+0xc/0x30
           pgd_alloc+0x2e/0x1a0
           mm_init+0x185/0x270
           dup_mm+0x6b/0x4f0
           copy_process+0x190d/0x1b10
           kernel_clone+0xba/0x3b0
           __do_sys_clone+0x8f/0xb0
           do_syscall_64+0x68/0x80
           entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Before commit 51cba1eb ("init_on_alloc: Optimize static branches")
      init_on_alloc never enabled static branch by default.  It could only be
      enabed explicitly by init_mem_debugging_and_hardening().
      
      But after commit 51cba1eb, a static branch could already be enabled
      by default.  There was no code to ever disable it.  That caused
      page_poison=1 / init_on_free=1 conflict.
      
      This change extends init_mem_debugging_and_hardening() to also disable
      static branch disabling.
      
      Link: https://lkml.kernel.org/r/20210714031935.4094114-1-keescook@chromium.org
      Link: https://lore.kernel.org/r/20210712215816.1512739-1-slyfox@gentoo.org
      Fixes: 51cba1eb
      
       ("init_on_alloc: Optimize static branches")
      Signed-off-by: default avatarSergei Trofimovich <slyfox@gentoo.org>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Co-developed-by: default avatarKees Cook <keescook@chromium.org>
      Reported-by: default avatarMikhail Morfikov <mmorfikov@gmail.com>
      Reported-by: default avatar <bowsingbetee@pm.me>
      Tested-by: default avatar <bowsingbetee@protonmail.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0e88a5be
    • Christoph Hellwig's avatar
      mm: call flush_dcache_page() in memcpy_to_page() and memzero_page() · ee791f0b
      Christoph Hellwig authored
      commit 8dad53a1 upstream.
      
      memcpy_to_page and memzero_page can write to arbitrary pages, which
      could be in the page cache or in high memory, so call
      flush_kernel_dcache_pages to flush the dcache.
      
      This is a problem when using these helpers on dcache challeneged
      architectures.  Right now there are just a few users, chances are no one
      used the PC floppy driver, the aha1542 driver for an ISA SCSI HBA, and a
      few advanced and optional btrfs and ext4 features on those platforms yet
      since the conversion.
      
      Link: https://lkml.kernel.org/r/20210713055231.137602-2-hch@lst.de
      Fixes: bb90d4bc ("mm/highmem: Lift memcpy_[to|from]_page to core")
      Fixes: 28961998
      
       ("iov_iter: lift memzero_page() to highmem.h")
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarIra Weiny <ira.weiny@intel.com>
      Cc: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ee791f0b
    • Alexander Potapenko's avatar
      kfence: skip all GFP_ZONEMASK allocations · 5040926b
      Alexander Potapenko authored
      commit 236e9f15 upstream.
      
      Allocation requests outside ZONE_NORMAL (MOVABLE, HIGHMEM or DMA) cannot
      be fulfilled by KFENCE, because KFENCE memory pool is located in a zone
      different from the requested one.
      
      Because callers of kmem_cache_alloc() may actually rely on the
      allocation to reside in the requested zone (e.g.  memory allocations
      done with __GFP_DMA must be DMAable), skip all allocations done with
      GFP_ZONEMASK and/or respective SLAB flags (SLAB_CACHE_DMA and
      SLAB_CACHE_DMA32).
      
      Link: https://lkml.kernel.org/r/20210714092222.1890268-2-glider@google.com
      Fixes: 0ce20dd8
      
       ("mm: add Kernel Electric-Fence infrastructure")
      Signed-off-by: default avatarAlexander Potapenko <glider@google.com>
      Reviewed-by: default avatarMarco Elver <elver@google.com>
      Acked-by: default avatarSouptick Joarder <jrdr.linux@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Cc: <stable@vger.kernel.org>	[5.12+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5040926b
    • Alexander Potapenko's avatar
      kfence: move the size check to the beginning of __kfence_alloc() · e9adaed2
      Alexander Potapenko authored
      commit 235a85cb upstream.
      
      Check the allocation size before toggling kfence_allocation_gate.
      
      This way allocations that can't be served by KFENCE will not result in
      waiting for another CONFIG_KFENCE_SAMPLE_INTERVAL without allocating
      anything.
      
      Link: https://lkml.kernel.org/r/20210714092222.1890268-1-glider@google.com
      
      
      Signed-off-by: default avatarAlexander Potapenko <glider@google.com>
      Suggested-by: default avatarMarco Elver <elver@google.com>
      Reviewed-by: default avatarMarco Elver <elver@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Marco Elver <elver@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: <stable@vger.kernel.org>	[5.12+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e9adaed2
    • Peter Collingbourne's avatar
      userfaultfd: do not untag user pointers · 60e7f63d
      Peter Collingbourne authored
      commit e71e2ace upstream.
      
      Patch series "userfaultfd: do not untag user pointers", v5.
      
      If a user program uses userfaultfd on ranges of heap memory, it may end
      up passing a tagged pointer to the kernel in the range.start field of
      the UFFDIO_REGISTER ioctl.  This can happen when using an MTE-capable
      allocator, or on Android if using the Tagged Pointers feature for MTE
      readiness [1].
      
      When a fault subsequently occurs, the tag is stripped from the fault
      address returned to the application in the fault.address field of struct
      uffd_msg.  However, from the application's perspective, the tagged
      address *is* the memory address, so if the application is unaware of
      memory tags, it may get confused by receiving an address that is, from
      its point of view, outside of the bounds of the allocation.  We observed
      this behavior in the kselftest for userfaultfd [2] but other
      applications could have the same problem.
      
      Address this by not untagging pointers passed to the userfaultfd ioctls.
      Instead, let the system call fail.  Also change the kselftest to use
      mmap so that it doesn't encounter this problem.
      
      [1] https://source.android.com/devices/tech/debug/tagged-pointers
      [2] tools/testing/selftests/vm/userfaultfd.c
      
      This patch (of 2):
      
      Do not untag pointers passed to the userfaultfd ioctls.  Instead, let
      the system call fail.  This will provide an early indication of problems
      with tag-unaware userspace code instead of letting the code get confused
      later, and is consistent with how we decided to handle brk/mmap/mremap
      in commit dcde2373 ("mm: Avoid creating virtual address aliases in
      brk()/mmap()/mremap()"), as well as being consistent with the existing
      tagged address ABI documentation relating to how ioctl arguments are
      handled.
      
      The code change is a revert of commit 7d032574 ("userfaultfd: untag
      user pointers") plus some fixups to some additional calls to
      validate_range that have appeared since then.
      
      [1] https://source.android.com/devices/tech/debug/tagged-pointers
      [2] tools/testing/selftests/vm/userfaultfd.c
      
      Link: https://lkml.kernel.org/r/20210714195437.118982-1-pcc@google.com
      Link: https://lkml.kernel.org/r/20210714195437.118982-2-pcc@google.com
      Link: https://linux-review.googlesource.com/id/I761aa9f0344454c482b83fcfcce547db0a25501b
      Fixes: 63f0c603
      
       ("arm64: Introduce prctl() options to control the tagged user addresses ABI")
      Signed-off-by: default avatarPeter Collingbourne <pcc@google.com>
      Reviewed-by: default avatarAndrey Konovalov <andreyknvl@gmail.com>
      Reviewed-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Alistair Delva <adelva@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Martin <Dave.Martin@arm.com>
      Cc: Evgenii Stepanov <eugenis@google.com>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Mitch Phillips <mitchp@google.com>
      Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: William McVicker <willmcvicker@google.com>
      Cc: <stable@vger.kernel.org>	[5.4]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      60e7f63d
    • Jens Axboe's avatar
      io_uring: fix early fdput() of file · a6ead781
      Jens Axboe authored
      commit 0cc936f7 upstream.
      
      A previous commit shuffled some code around, and inadvertently used
      struct file after fdput() had been called on it. As we can't touch
      the file post fdput() dropping our reference, move the fdput() to
      after that has been done.
      
      Cc: Pavel Begunkov <asml.silence@gmail.com>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/io-uring/YPnqM0fY3nM5RdRI@zeniv-ca.linux.org.uk/
      Fixes: f2a48dd0
      
       ("io_uring: refactor io_sq_offload_create()")
      Reported-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a6ead781
    • Pavel Begunkov's avatar
      io_uring: remove double poll entry on arm failure · 81cebade
      Pavel Begunkov authored
      commit 46fee9ab upstream.
      
      __io_queue_proc() can enqueue both poll entries and still fail
      afterwards, so the callers trying to cancel it should also try to remove
      the second poll entry (if any).
      
      For example, it may leave the request alive referencing a io_uring
      context but not accessible for cancellation:
      
      [  282.599913][ T1620] task:iou-sqp-23145   state:D stack:28720 pid:23155 ppid:  8844 flags:0x00004004
      [  282.609927][ T1620] Call Trace:
      [  282.613711][ T1620]  __schedule+0x93a/0x26f0
      [  282.634647][ T1620]  schedule+0xd3/0x270
      [  282.638874][ T1620]  io_uring_cancel_generic+0x54d/0x890
      [  282.660346][ T1620]  io_sq_thread+0xaac/0x1250
      [  282.696394][ T1620]  ret_from_fork+0x1f/0x30
      
      Cc: stable@vger.kernel.org
      Fixes: 18bceab1
      
       ("io_uring: allow POLL_ADD with double poll_wait() users")
      Reported-and-tested-by: default avatar <syzbot+ac957324022b7132accf@syzkaller.appspotmail.com>
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Link: https://lore.kernel.org/r/0ec1228fc5eda4cb524eeda857da8efdc43c331c.1626774457.git.asml.silence@gmail.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      81cebade
    • Pavel Begunkov's avatar
      io_uring: explicitly count entries for poll reqs · 0d80ae09
      Pavel Begunkov authored
      commit 68b11e8b upstream.
      
      If __io_queue_proc() fails to add a second poll entry, e.g. kmalloc()
      failed, but it goes on with a third waitqueue, it may succeed and
      overwrite the error status. Count the number of poll entries we added,
      so we can set pt->error to zero at the beginning and find out when the
      mentioned scenario happens.
      
      Cc: stable@vger.kernel.org
      Fixes: 18bceab1
      
       ("io_uring: allow POLL_ADD with double poll_wait() users")
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Link: https://lore.kernel.org/r/9d6b9e561f88bcc0163623b74a76c39f712151c3.1626774457.git.asml.silence@gmail.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0d80ae09
    • Peter Collingbourne's avatar
      selftest: use mmap instead of posix_memalign to allocate memory · 2f13b6fe
      Peter Collingbourne authored
      commit 0db282ba upstream.
      
      This test passes pointers obtained from anon_allocate_area to the
      userfaultfd and mremap APIs.  This causes a problem if the system
      allocator returns tagged pointers because with the tagged address ABI
      the kernel rejects tagged addresses passed to these APIs, which would
      end up causing the test to fail.  To make this test compatible with such
      system allocators, stop using the system allocator to allocate memory in
      anon_allocate_area, and instead just use mmap.
      
      Link: https://lkml.kernel.org/r/20210714195437.118982-3-pcc@google.com
      Link: https://linux-review.googlesource.com/id/Icac91064fcd923f77a83e8e133f8631c5b8fc241
      Fixes: c47174fc
      
       ("userfaultfd: selftest")
      Co-developed-by: default avatarLokesh Gidra <lokeshgidra@google.com>
      Signed-off-by: default avatarLokesh Gidra <lokeshgidra@google.com>
      Signed-off-by: default avatarPeter Collingbourne <pcc@google.com>
      Reviewed-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
      Cc: Dave Martin <Dave.Martin@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Alistair Delva <adelva@google.com>
      Cc: William McVicker <willmcvicker@google.com>
      Cc: Evgenii Stepanov <eugenis@google.com>
      Cc: Mitch Phillips <mitchp@google.com>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: <stable@vger.kernel.org>	[5.4]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2f13b6fe
    • Frederic Weisbecker's avatar
      posix-cpu-timers: Fix rearm racing against process tick · fae0c4bb
      Frederic Weisbecker authored
      commit 1a3402d9
      
       upstream.
      
      Since the process wide cputime counter is started locklessly from
      posix_cpu_timer_rearm(), it can be concurrently stopped by operations
      on other timers from the same thread group, such as in the following
      unlucky scenario:
      
               CPU 0                                CPU 1
               -----                                -----
                                                 timer_settime(TIMER B)
         posix_cpu_timer_rearm(TIMER A)
             cpu_clock_sample_group()
                 (pct->timers_active already true)
      
                                                 handle_posix_cpu_timers()
                                                     check_process_timers()
                                                         stop_process_timers()
                                                             pct->timers_active = false
             arm_timer(TIMER A)
      
         tick -> run_posix_cpu_timers()
             // sees !pct->timers_active, ignore
             // our TIMER A
      
      Fix this with simply locking process wide cputime counting start and
      timer arm in the same block.
      
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarFrederic Weisbecker <frederic@kernel.org>
      Fixes: 60f2ceaa
      
       ("posix-cpu-timers: Remove unnecessary locking around cpu_clock_sample_group")
      Cc: stable@vger.kernel.org
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fae0c4bb
    • Loic Poulain's avatar
      bus: mhi: pci_generic: Fix inbound IPCR channel · 52db60a9
      Loic Poulain authored
      commit b8a97f2a upstream.
      
      The qrtr-mhi client driver assumes that inbound buffers are
      automatically allocated and queued by the MHI core, but this
      doesn't happen for mhi pci devices since IPCR inbound channel is
      not flagged with auto_queue, causing unusable IPCR (qrtr)
      feature. Fix that.
      
      Link: https://lore.kernel.org/r/1625736749-24947-1-git-send-email-loic.poulain@linaro.org
      [mani: fixed a spelling mistake in commit description]
      Fixes: 855a70c1
      
       ("bus: mhi: Add MHI PCI support for WWAN modems")
      Cc: stable@vger.kernel.org #5.10
      Reviewed-by: default avatarHemant kumar <hemantk@codeaurora.org>
      Reviewed-by: default avatarManivannan Sadhasivam <mani@kernel.org>
      Signed-off-by: default avatarLoic Poulain <loic.poulain@linaro.org>
      Signed-off-by: default avatarManivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
      Link: https://lore.kernel.org/r/20210716075106.49938-4-manivannan.sadhasivam@linaro.org
      
      
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      52db60a9
    • Bhaumik Bhatt's avatar
      bus: mhi: core: Validate channel ID when processing command completions · aed4f5b5
      Bhaumik Bhatt authored
      commit 546362a9 upstream.
      
      MHI reads the channel ID from the event ring element sent by the
      device which can be any value between 0 and 255. In order to
      prevent any out of bound accesses, add a check against the maximum
      number of channels supported by the controller and those channels
      not configured yet so as to skip processing of that event ring
      element.
      
      Link: https://lore.kernel.org/r/1624558141-11045-1-git-send-email-bbhatt@codeaurora.org
      Fixes: 1d3173a3
      
       ("bus: mhi: core: Add support for processing events from client device")
      Cc: stable@vger.kernel.org #5.10
      Reviewed-by: default avatarHemant Kumar <hemantk@codeaurora.org>
      Reviewed-by: default avatarManivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
      Reviewed-by: default avatarJeffrey Hugo <quic_jhugo@quicinc.com>
      Signed-off-by: default avatarBhaumik Bhatt <bbhatt@codeaurora.org>
      Signed-off-by: default avatarManivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
      Link: https://lore.kernel.org/r/20210716075106.49938-3-manivannan.sadhasivam@linaro.org
      
      
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      aed4f5b5
    • Bhaumik Bhatt's avatar
      bus: mhi: pci_generic: Apply no-op for wake using sideband wake boolean · a8827068
      Bhaumik Bhatt authored
      commit 56f6f4c4 upstream.
      
      Devices such as SDX24 do not have the provision for inband wake
      doorbell in the form of channel 127 and instead have a sideband
      GPIO for it. Newer devices such as SDX55 or SDX65 support inband
      wake method by default. Ensure the functionality is used based on
      this such that device wake stays held when a client driver uses
      mhi_device_get() API or the equivalent debugfs entry.
      
      Link: https://lore.kernel.org/r/1624560809-30610-1-git-send-email-bbhatt@codeaurora.org
      Fixes: e3e5e650
      
       ("bus: mhi: pci_generic: No-Op for device_wake operations")
      Cc: stable@vger.kernel.org #5.12
      Reviewed-by: default avatarManivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
      Signed-off-by: default avatarBhaumik Bhatt <bbhatt@codeaurora.org>
      Signed-off-by: default avatarManivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
      Link: https://lore.kernel.org/r/20210716075106.49938-2-manivannan.sadhasivam@linaro.org
      
      
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a8827068
    • Peter Ujfalusi's avatar
      driver core: auxiliary bus: Fix memory leak when driver_register() fail · ce5b3de5
      Peter Ujfalusi authored
      commit 4afa0c22 upstream.
      
      If driver_register() returns with error we need to free the memory
      allocated for auxdrv->driver.name before returning from
      __auxiliary_driver_register()
      
      Fixes: 7de3697e
      
       ("Add auxiliary bus support")
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Cc: stable <stable@vger.kernel.org>
      Signed-off-by: default avatarPeter Ujfalusi <peter.ujfalusi@linux.intel.com>
      Link: https://lore.kernel.org/r/20210713093438.3173-1-peter.ujfalusi@linux.intel.com
      
      
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ce5b3de5
    • Markus Boehme's avatar
      ixgbe: Fix packet corruption due to missing DMA sync · 423123e4
      Markus Boehme authored
      commit 09cfae9f upstream.
      
      When receiving a packet with multiple fragments, hardware may still
      touch the first fragment until the entire packet has been received. The
      driver therefore keeps the first fragment mapped for DMA until end of
      packet has been asserted, and delays its dma_sync call until then.
      
      The driver tries to fit multiple receive buffers on one page. When using
      3K receive buffers (e.g. using Jumbo frames and legacy-rx is turned
      off/build_skb is being used) on an architecture with 4K pages, the
      driver allocates an order 1 compound page and uses one page per receive
      buffer. To determine the correct offset for a delayed DMA sync of the
      first fragment of a multi-fragment packet, the driver then cannot just
      use PAGE_MASK on the DMA address but has to construct a mask based on
      the actual size of the backing page.
      
      Using PAGE_MASK in the 3K RX buffer/4K page architecture configuration
      will always sync the first page of a compound page. With the SWIOTLB
      enabled this can lead to corrupted packets (zeroed out first fragment,
      re-used garbage from another packet) and various consequences, such as
      slow/stalling data transfers and connection resets. For example, testing
      on a link with MTU exceeding 3058 bytes on a host with SWIOTLB enabled
      (e.g. "iommu=soft swiotlb=262144,force") TCP transfers quickly fizzle
      out without this patch.
      
      Cc: stable@vger.kernel.org
      Fixes: 0c5661ec
      
       ("ixgbe: fix crash in build_skb Rx code path")
      Signed-off-by: default avatarMarkus Boehme <markubo@amazon.com>
      Tested-by: default avatarTony Brelinski <tonyx.brelinski@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      423123e4
    • Gustavo A. R. Silva's avatar
      media: ngene: Fix out-of-bounds bug in ngene_command_config_free_buf() · b9a178f1
      Gustavo A. R. Silva authored
      commit 8d4abca9 upstream.
      
      Fix an 11-year old bug in ngene_command_config_free_buf() while
      addressing the following warnings caught with -Warray-bounds:
      
      arch/alpha/include/asm/string.h:22:16: warning: '__builtin_memcpy' offset [12, 16] from the object at 'com' is out of the bounds of referenced subobject 'config' with type 'unsigned char' at offset 10 [-Warray-bounds]
      arch/x86/include/asm/string_32.h:182:25: warning: '__builtin_memcpy' offset [12, 16] from the object at 'com' is out of the bounds of referenced subobject 'config' with type 'unsigned char' at offset 10 [-Warray-bounds]
      
      The problem is that the original code is trying to copy 6 bytes of
      data into a one-byte size member _config_ of the wrong structue
      FW_CONFIGURE_BUFFERS, in a single call to memcpy(). This causes a
      legitimate compiler warning because memcpy() overruns the length
      of &com.cmd.ConfigureBuffers.config. It seems that the right
      structure is FW_CONFIGURE_FREE_BUFFERS, instead, because it contains
      6 more members apart from the header _hdr_. Also, the name of
      the function ngene_command_config_free_buf() suggests that the actual
      intention is to ConfigureFreeBuffers, instead of ConfigureBuffers
      (which takes place in the function ngene_command_config_buf(), above).
      
      Fix this by enclosing those 6 members of struct FW_CONFIGURE_FREE_BUFFERS
      into new struct config, and use &com.cmd.ConfigureFreeBuffers.config as
      the destination address, instead of &com.cmd.ConfigureBuffers.config,
      when calling memcpy().
      
      This also helps with the ongoing efforts to globally enable
      -Warray-bounds and get us closer to being able to tighten the
      FORTIFY_SOURCE routines on memcpy().
      
      Link: https://github.com/KSPP/linux/issues/109
      Fixes: dae52d00
      
       ("V4L/DVB: ngene: Initial check-in")
      Cc: stable@vger.kernel.org
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarGustavo A. R. Silva <gustavoars@kernel.org>
      Link: https://lore.kernel.org/linux-hardening/20210420001631.GA45456@embeddedor/
      
      
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b9a178f1
    • Filipe Manana's avatar
      btrfs: fix lock inversion problem when doing qgroup extent tracing · f5ef2fe0
      Filipe Manana authored
      commit 8949b9a1
      
       upstream.
      
      At btrfs_qgroup_trace_extent_post() we call btrfs_find_all_roots() with a
      NULL value as the transaction handle argument, which makes that function
      take the commit_root_sem semaphore, which is necessary when we don't hold
      a transaction handle or any other mechanism to prevent a transaction
      commit from wiping out commit roots.
      
      However btrfs_qgroup_trace_extent_post() can be called in a context where
      we are holding a write lock on an extent buffer from a subvolume tree,
      namely from btrfs_truncate_inode_items(), called either during truncate
      or unlink operations. In this case we end up with a lock inversion problem
      because the commit_root_sem is a higher level lock, always supposed to be
      acquired before locking any extent buffer.
      
      Lockdep detects this lock inversion problem since we switched the extent
      buffer locks from custom locks to semaphores, and when running btrfs/158
      from fstests, it reported the following trace:
      
      [ 9057.626435] ======================================================
      [ 9057.627541] WARNING: possible circular locking dependency detected
      [ 9057.628334] 5.14.0-rc2-btrfs-next-93 #1 Not tainted
      [ 9057.628961] ------------------------------------------------------
      [ 9057.629867] kworker/u16:4/30781 is trying to acquire lock:
      [ 9057.630824] ffff8e2590f58760 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x24/0x110 [btrfs]
      [ 9057.632542]
                     but task is already holding lock:
      [ 9057.633551] ffff8e25582d4b70 (&fs_info->commit_root_sem){++++}-{3:3}, at: iterate_extent_inodes+0x10b/0x280 [btrfs]
      [ 9057.635255]
                     which lock already depends on the new lock.
      
      [ 9057.636292]
                     the existing dependency chain (in reverse order) is:
      [ 9057.637240]
                     -> #1 (&fs_info->commit_root_sem){++++}-{3:3}:
      [ 9057.638138]        down_read+0x46/0x140
      [ 9057.638648]        btrfs_find_all_roots+0x41/0x80 [btrfs]
      [ 9057.639398]        btrfs_qgroup_trace_extent_post+0x37/0x70 [btrfs]
      [ 9057.640283]        btrfs_add_delayed_data_ref+0x418/0x490 [btrfs]
      [ 9057.641114]        btrfs_free_extent+0x35/0xb0 [btrfs]
      [ 9057.641819]        btrfs_truncate_inode_items+0x424/0xf70 [btrfs]
      [ 9057.642643]        btrfs_evict_inode+0x454/0x4f0 [btrfs]
      [ 9057.643418]        evict+0xcf/0x1d0
      [ 9057.643895]        do_unlinkat+0x1e9/0x300
      [ 9057.644525]        do_syscall_64+0x3b/0xc0
      [ 9057.645110]        entry_SYSCALL_64_after_hwframe+0x44/0xae
      [ 9057.645835]
                     -> #0 (btrfs-tree-00){++++}-{3:3}:
      [ 9057.646600]        __lock_acquire+0x130e/0x2210
      [ 9057.647248]        lock_acquire+0xd7/0x310
      [ 9057.647773]        down_read_nested+0x4b/0x140
      [ 9057.648350]        __btrfs_tree_read_lock+0x24/0x110 [btrfs]
      [ 9057.649175]        btrfs_read_lock_root_node+0x31/0x40 [btrfs]
      [ 9057.650010]        btrfs_search_slot+0x537/0xc00 [btrfs]
      [ 9057.650849]        scrub_print_warning_inode+0x89/0x370 [btrfs]
      [ 9057.651733]        iterate_extent_inodes+0x1e3/0x280 [btrfs]
      [ 9057.652501]        scrub_print_warning+0x15d/0x2f0 [btrfs]
      [ 9057.653264]        scrub_handle_errored_block.isra.0+0x135f/0x1640 [btrfs]
      [ 9057.654295]        scrub_bio_end_io_worker+0x101/0x2e0 [btrfs]
      [ 9057.655111]        btrfs_work_helper+0xf8/0x400 [btrfs]
      [ 9057.655831]        process_one_work+0x247/0x5a0
      [ 9057.656425]        worker_thread+0x55/0x3c0
      [ 9057.656993]        kthread+0x155/0x180
      [ 9057.657494]        ret_from_fork+0x22/0x30
      [ 9057.658030]
                     other info that might help us debug this:
      
      [ 9057.659064]  Possible unsafe locking scenario:
      
      [ 9057.659824]        CPU0                    CPU1
      [ 9057.660402]        ----                    ----
      [ 9057.660988]   lock(&fs_info->commit_root_sem);
      [ 9057.661581]                                lock(btrfs-tree-00);
      [ 9057.662348]                                lock(&fs_info->commit_root_sem);
      [ 9057.663254]   lock(btrfs-tree-00);
      [ 9057.663690]
                      *** DEADLOCK ***
      
      [ 9057.664437] 4 locks held by kworker/u16:4/30781:
      [ 9057.665023]  #0: ffff8e25922a1148 ((wq_completion)btrfs-scrub){+.+.}-{0:0}, at: process_one_work+0x1c7/0x5a0
      [ 9057.666260]  #1: ffffabb3451ffe70 ((work_completion)(&work->normal_work)){+.+.}-{0:0}, at: process_one_work+0x1c7/0x5a0
      [ 9057.667639]  #2: ffff8e25922da198 (&ret->mutex){+.+.}-{3:3}, at: scrub_handle_errored_block.isra.0+0x5d2/0x1640 [btrfs]
      [ 9057.669017]  #3: ffff8e25582d4b70 (&fs_info->commit_root_sem){++++}-{3:3}, at: iterate_extent_inodes+0x10b/0x280 [btrfs]
      [ 9057.670408]
                     stack backtrace:
      [ 9057.670976] CPU: 7 PID: 30781 Comm: kworker/u16:4 Not tainted 5.14.0-rc2-btrfs-next-93 #1
      [ 9057.672030] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
      [ 9057.673492] Workqueue: btrfs-scrub btrfs_work_helper [btrfs]
      [ 9057.674258] Call Trace:
      [ 9057.674588]  dump_stack_lvl+0x57/0x72
      [ 9057.675083]  check_noncircular+0xf3/0x110
      [ 9057.675611]  __lock_acquire+0x130e/0x2210
      [ 9057.676132]  lock_acquire+0xd7/0x310
      [ 9057.676605]  ? __btrfs_tree_read_lock+0x24/0x110 [btrfs]
      [ 9057.677313]  ? lock_is_held_type+0xe8/0x140
      [ 9057.677849]  down_read_nested+0x4b/0x140
      [ 9057.678349]  ? __btrfs_tree_read_lock+0x24/0x110 [btrfs]
      [ 9057.679068]  __btrfs_tree_read_lock+0x24/0x110 [btrfs]
      [ 9057.679760]  btrfs_read_lock_root_node+0x31/0x40 [btrfs]
      [ 9057.680458]  btrfs_search_slot+0x537/0xc00 [btrfs]
      [ 9057.681083]  ? _raw_spin_unlock+0x29/0x40
      [ 9057.681594]  ? btrfs_find_all_roots_safe+0x11f/0x140 [btrfs]
      [ 9057.682336]  scrub_print_warning_inode+0x89/0x370 [btrfs]
      [ 9057.683058]  ? btrfs_find_all_roots_safe+0x11f/0x140 [btrfs]
      [ 9057.683834]  ? scrub_write_block_to_dev_replace+0xb0/0xb0 [btrfs]
      [ 9057.684632]  iterate_extent_inodes+0x1e3/0x280 [btrfs]
      [ 9057.685316]  scrub_print_warning+0x15d/0x2f0 [btrfs]
      [ 9057.685977]  ? ___ratelimit+0xa4/0x110
      [ 9057.686460]  scrub_handle_errored_block.isra.0+0x135f/0x1640 [btrfs]
      [ 9057.687316]  scrub_bio_end_io_worker+0x101/0x2e0 [btrfs]
      [ 9057.688021]  btrfs_work_helper+0xf8/0x400 [btrfs]
      [ 9057.688649]  ? lock_is_held_type+0xe8/0x140
      [ 9057.689180]  process_one_work+0x247/0x5a0
      [ 9057.689696]  worker_thread+0x55/0x3c0
      [ 9057.690175]  ? process_one_work+0x5a0/0x5a0
      [ 9057.690731]  kthread+0x155/0x180
      [ 9057.691158]  ? set_kthread_struct+0x40/0x40
      [ 9057.691697]  ret_from_fork+0x22/0x30
      
      Fix this by making btrfs_find_all_roots() never attempt to lock the
      commit_root_sem when it is called from btrfs_qgroup_trace_extent_post().
      
      We can't just pass a non-NULL transaction handle to btrfs_find_all_roots()
      from btrfs_qgroup_trace_extent_post(), because that would make backref
      lookup not use commit roots and acquire read locks on extent buffers, and
      therefore could deadlock when btrfs_qgroup_trace_extent_post() is called
      from the btrfs_truncate_inode_items() code path which has acquired a write
      lock on an extent buffer of the subvolume btree.
      
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f5ef2fe0
    • Filipe Manana's avatar
      btrfs: fix unpersisted i_size on fsync after expanding truncate · 6f919907
      Filipe Manana authored
      commit 9acc8103 upstream.
      
      If we have an inode that does not have the full sync flag set, was changed
      in the current transaction, then it is logged while logging some other
      inode (like its parent directory for example), its i_size is increased by
      a truncate operation, the log is synced through an fsync of some other
      inode and then finally we explicitly call fsync on our inode, the new
      i_size is not persisted.
      
      The following example shows how to trigger it, with comments explaining
      how and why the issue happens:
      
        $ mkfs.btrfs -f /dev/sdc
        $ mount /dev/sdc /mnt
      
        $ touch /mnt/foo
        $ xfs_io -f -c "pwrite -S 0xab 0 1M" /mnt/bar
      
        $ sync
      
        # Fsync bar, this will be a noop since the file has not yet been
        # modified in the current transaction. The goal here is to clear
        # BTRFS_INODE_NEEDS_FULL_SYNC from the inode's runtime flags.
        $ xfs_io -c "fsync" /mnt/bar
      
        # Now rename both files, without changing their parent directory.
        $ mv /mnt/bar /mnt/bar2
        $ mv /mnt/foo /mnt/foo2
      
        # Increase the size of bar2 with a truncate operation.
        $ xfs_io -c "truncate 2M" /mnt/bar2
      
        # Now fsync foo2, this results in logging its parent inode (the root
        # directory), and logging the parent results in logging the inode of
        # file bar2 (its inode item and the new name). The inode of file bar2
        # is logged with an i_size of 0 bytes since it's logged in
        # LOG_INODE_EXISTS mode, meaning we are only logging its names (and
        # xattrs if it had any) and the i_size of the inode will not be changed
        # when the log is replayed.
        $ xfs_io -c "fsync" /mnt/foo2
      
        # Now explicitly fsync bar2. This resulted in doing nothing, not
        # logging the inode with the new i_size of 2M and the hole from file
        # offset 1M to 2M. Because the inode did not have the flag
        # BTRFS_INODE_NEEDS_FULL_SYNC set, when it was logged through the
        # fsync of file foo2, its last_log_commit field was updated,
        # resulting in this explicit of file bar2 not doing anything.
        $ xfs_io -c "fsync" /mnt/bar2
      
        # File bar2 content and size before a power failure.
        $ od -A d -t x1 /mnt/bar2
        0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab
        *
        1048576 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        *
        2097152
      
        <power failure>
      
        # Mount the filesystem to replay the log.
        $ mount /dev/sdc /mnt
      
        # Read the file again, should have the same content and size as before
        # the power failure happened, but it doesn't, i_size is still at 1M.
        $ od -A d -t x1 /mnt/bar2
        0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab
        *
        1048576
      
      This started to happen after commit 209ecbb8 ("btrfs: remove stale
      comment and logic from btrfs_inode_in_log()"), since btrfs_inode_in_log()
      no longer checks if the inode's list of modified extents is not empty.
      However, checking that list is not the right way to address this case
      and the check was added long time ago in commit 125c4cf9
      
      
      ("Btrfs: set inode's logged_trans/last_log_commit after ranged fsync")
      for a different purpose, to address consecutive ranged fsyncs.
      
      The reason that checking for the list emptiness makes this test pass is
      because during an expanding truncate we create an extent map to represent
      a hole from the old i_size to the new i_size, and add that extent map to
      the list of modified extents in the inode. However if we are low on
      available memory and we can not allocate a new extent map, then we don't
      treat it as an error and just set the full sync flag on the inode, so that
      the next fsync does not rely on the list of modified extents - so checking
      for the emptiness of the list to decide if the inode needs to be logged is
      not reliable, and results in not logging the inode if it was not possible
      to allocate the extent map for the hole.
      
      Fix this by ensuring that if we are only logging that an inode exists
      (inode item, names/references and xattrs), we don't update the inode's
      last_log_commit even if it does not have the full sync runtime flag set.
      
      A test case for fstests follows soon.
      
      CC: stable@vger.kernel.org # 5.13+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6f919907
    • Anand Jain's avatar
      btrfs: check for missing device in btrfs_trim_fs · a02b5448
      Anand Jain authored
      commit 16a200f6
      
       upstream.
      
      A fstrim on a degraded raid1 can trigger the following null pointer
      dereference:
      
        BTRFS info (device loop0): allowing degraded mounts
        BTRFS info (device loop0): disk space caching is enabled
        BTRFS info (device loop0): has skinny extents
        BTRFS warning (device loop0): devid 2 uuid 97ac16f7-e14d-4db1-95bc-3d489b424adb is missing
        BTRFS warning (device loop0): devid 2 uuid 97ac16f7-e14d-4db1-95bc-3d489b424adb is missing
        BTRFS info (device loop0): enabling ssd optimizations
        BUG: kernel NULL pointer dereference, address: 0000000000000620
        PGD 0 P4D 0
        Oops: 0000 [#1] SMP NOPTI
        CPU: 0 PID: 4574 Comm: fstrim Not tainted 5.13.0-rc7+ #31
        Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
        RIP: 0010:btrfs_trim_fs+0x199/0x4a0 [btrfs]
        RSP: 0018:ffff959541797d28 EFLAGS: 00010293
        RAX: 0000000000000000 RBX: ffff946f84eca508 RCX: a7a67937adff8608
        RDX: ffff946e8122d000 RSI: 0000000000000000 RDI: ffffffffc02fdbf0
        RBP: ffff946ea4615000 R08: 0000000000000001 R09: 0000000000000000
        R10: 0000000000000000 R11: ffff946e8122d960 R12: 0000000000000000
        R13: ffff959541797db8 R14: ffff946e8122d000 R15: ffff959541797db8
        FS:  00007f55917a5080(0000) GS:ffff946f9bc00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000620 CR3: 000000002d2c8001 CR4: 00000000000706f0
        Call Trace:
        btrfs_ioctl_fitrim+0x167/0x260 [btrfs]
        btrfs_ioctl+0x1c00/0x2fe0 [btrfs]
        ? selinux_file_ioctl+0x140/0x240
        ? syscall_trace_enter.constprop.0+0x188/0x240
        ? __x64_sys_ioctl+0x83/0xb0
        __x64_sys_ioctl+0x83/0xb0
      
      Reproducer:
      
        $ mkfs.btrfs -fq -d raid1 -m raid1 /dev/loop0 /dev/loop1
        $ mount /dev/loop0 /btrfs
        $ umount /btrfs
        $ btrfs dev scan --forget
        $ mount -o degraded /dev/loop0 /btrfs
      
        $ fstrim /btrfs
      
      The reason is we call btrfs_trim_free_extents() for the missing device,
      which uses device->bdev (NULL for missing device) to find if the device
      supports discard.
      
      Fix is to check if the device is missing before calling
      btrfs_trim_free_extents().
      
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a02b5448
    • Steven Rostedt (VMware)'s avatar
      tracing: Synthetic event field_pos is an index not a boolean · 020d8cea
      Steven Rostedt (VMware) authored
      commit 3b13911a upstream.
      
      Performing the following:
      
       ># echo 'wakeup_lat s32 pid; u64 delta; char wake_comm[]' > synthetic_events
       ># echo 'hist:keys=pid:__arg__1=common_timestamp.usecs' > events/sched/sched_waking/trigger
       ># echo 'hist:keys=next_pid:pid=next_pid,delta=common_timestamp.usecs-$__arg__1:onmatch(sched.sched_waking).trace(wakeup_lat,$pid,$delta,prev_comm)'\
            > events/sched/sched_switch/trigger
       ># echo 1 > events/synthetic/enable
      
      Crashed the kernel:
      
       BUG: kernel NULL pointer dereference, address: 000000000000001b
       #PF: supervisor read access in kernel mode
       #PF: error_code(0x0000) - not-present page
       PGD 0 P4D 0
       Oops: 0000 [#1] PREEMPT SMP
       CPU: 7 PID: 0 Comm: swapper/7 Not tainted 5.13.0-rc5-test+ #104
       Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016
       RIP: 0010:strlen+0x0/0x20
       Code: f6 82 80 2b 0b bc 20 74 11 0f b6 50 01 48 83 c0 01 f6 82 80 2b 0b bc
        20 75 ef c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 <80> 3f 00 74 10
        48 89 f8 48 83 c0 01 80 38 9 f8 c3 31
       RSP: 0018:ffffaa75000d79d0 EFLAGS: 00010046
       RAX: 0000000000000002 RBX: ffff9cdb55575270 RCX: 0000000000000000
       RDX: ffff9cdb58c7a320 RSI: ffffaa75000d7b40 RDI: 000000000000001b
       RBP: ffffaa75000d7b40 R08: ffff9cdb40a4f010 R09: ffffaa75000d7ab8
       R10: ffff9cdb4398c700 R11: 0000000000000008 R12: ffff9cdb58c7a320
       R13: ffff9cdb55575270 R14: ffff9cdb58c7a000 R15: 0000000000000018
       FS:  0000000000000000(0000) GS:ffff9cdb5aa00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 000000000000001b CR3: 00000000c0612006 CR4: 00000000001706e0
       Call Trace:
        trace_event_raw_event_synth+0x90/0x1d0
        action_trace+0x5b/0x70
        event_hist_trigger+0x4bd/0x4e0
        ? cpumask_next_and+0x20/0x30
        ? update_sd_lb_stats.constprop.0+0xf6/0x840
        ? __lock_acquire.constprop.0+0x125/0x550
        ? find_held_lock+0x32/0x90
        ? sched_clock_cpu+0xe/0xd0
        ? lock_release+0x155/0x440
        ? update_load_avg+0x8c/0x6f0
        ? enqueue_entity+0x18a/0x920
        ? __rb_reserve_next+0xe5/0x460
        ? ring_buffer_lock_reserve+0x12a/0x3f0
        event_triggers_call+0x52/0xe0
        trace_event_buffer_commit+0x1ae/0x240
        trace_event_raw_event_sched_switch+0x114/0x170
        __traceiter_sched_switch+0x39/0x50
        __schedule+0x431/0xb00
        schedule_idle+0x28/0x40
        do_idle+0x198/0x2e0
        cpu_startup_entry+0x19/0x20
        secondary_startup_64_no_verify+0xc2/0xcb
      
      The reason is that the dynamic events array keeps track of the field
      position of the fields array, via the field_pos variable in the
      synth_field structure. Unfortunately, that field is a boolean for some
      reason, which means any field_pos greater than 1 will be a bug (in this
      case it was 2).
      
      Link: https://lkml.kernel.org/r/20210721191008.638bce34@oasis.local.home
      
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: stable@vger.kernel.org
      Fixes: bd82631d
      
       ("tracing: Add support for dynamic strings to synthetic events")
      Reviewed-by: default avatarTom Zanussi <zanussi@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      020d8cea
    • Haoran Luo's avatar
      tracing: Fix bug in rb_per_cpu_empty() that might cause deadloop. · 917a5bdd
      Haoran Luo authored
      commit 67f0d6d9 upstream.
      
      The "rb_per_cpu_empty()" misinterpret the condition (as not-empty) when
      "head_page" and "commit_page" of "struct ring_buffer_per_cpu" points to
      the same buffer page, whose "buffer_data_page" is empty and "read" field
      is non-zero.
      
      An error scenario could be constructed as followed (kernel perspective):
      
      1. All pages in the buffer has been accessed by reader(s) so that all of
      them will have non-zero "read" field.
      
      2. Read and clear all buffer pages so that "rb_num_of_entries()" will
      return 0 rendering there's no more data to read. It is also required
      that the "read_page", "commit_page" and "tail_page" points to the same
      page, while "head_page" is the next page of them.
      
      3. Invoke "ring_buffer_lock_reserve()" with large enough "length"
      so that it shot pass the end of current tail buffer page. Now the
      "head_page", "commit_page" and "tail_page" points to the same page.
      
      4. Discard current event with "ring_buffer_discard_commit()", so that
      "head_page", "commit_page" and "tail_page" points to a page whose buffer
      data page is now empty.
      
      When the error scenario has been constructed, "tracing_read_pipe" will
      be trapped inside a deadloop: "trace_empty()" returns 0 since
      "rb_per_cpu_empty()" returns 0 when it hits the CPU containing such
      constructed ring buffer. Then "trace_find_next_entry_inc()" always
      return NULL since "rb_num_of_entries()" reports there's no more entry
      to read. Finally "trace_seq_to_user()" returns "-EBUSY" spanking
      "tracing_read_pipe" back to the start of the "waitagain" loop.
      
      I've also written a proof-of-concept script to construct the scenario
      and trigger the bug automatically, you can use it to trace and validate
      my reasoning above:
      
        https://github.com/aegistudio/RingBufferDetonator.git
      
      Tests has been carried out on linux kernel 5.14-rc2
      (2734d6c1), my fixed version
      of kernel (for testing whether my update fixes the bug) and
      some older kernels (for range of affected kernels). Test result is
      also attached to the proof-of-concept repository.
      
      Link: https://lore.kernel.org/linux-trace-devel/YPaNxsIlb2yjSi5Y@aegistudio/
      Link: https://lore.kernel.org/linux-trace-devel/YPgrN85WL9VyrZ55@aegistudio
      
      Cc: stable@vger.kernel.org
      Fixes: bf41a158
      
       ("ring-buffer: make reentrant")
      Suggested-by: default avatarLinus Torvalds <torvalds@linuxfoundation.org>
      Signed-off-by: default avatarHaoran Luo <www@aegistudio.net>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      917a5bdd
    • Steven Rostedt (VMware)'s avatar
      tracing/histogram: Rename "cpu" to "common_cpu" · 29ecaddb
      Steven Rostedt (VMware) authored
      commit 1e3bac71 upstream.
      
      Currently the histogram logic allows the user to write "cpu" in as an
      event field, and it will record the CPU that the event happened on.
      
      The problem with this is that there's a lot of events that have "cpu"
      as a real field, and using "cpu" as the CPU it ran on, makes it
      impossible to run histograms on the "cpu" field of events.
      
      For example, if I want to have a histogram on the count of the
      workqueue_queue_work event on its cpu field, running:
      
       ># echo 'hist:keys=cpu' > events/workqueue/workqueue_queue_work/trigger
      
      Gives a misleading and wrong result.
      
      Change the command to "common_cpu" as no event should have "common_*"
      fields as that's a reserved name for fields used by all events. And
      this makes sense here as common_cpu would be a field used by all events.
      
      Now we can even do:
      
       ># echo 'hist:keys=common_cpu,cpu if cpu < 100' > events/workqueue/workqueue_queue_work/trigger
       ># cat events/workqueue/workqueue_queue_work/hist
       # event histogram
       #
       # trigger info: hist:keys=common_cpu,cpu:vals=hitcount:sort=hitcount:size=2048 if cpu < 100 [active]
       #
      
       { common_cpu:          0, cpu:          2 } hitcount:          1
       { common_cpu:          0, cpu:          4 } hitcount:          1
       { common_cpu:          7, cpu:          7 } hitcount:          1
       { common_cpu:          0, cpu:          7 } hitcount:          1
       { common_cpu:          0, cpu:          1 } hitcount:          1
       { common_cpu:          0, cpu:          6 } hitcount:          2
       { common_cpu:          0, cpu:          5 } hitcount:          2
       { common_cpu:          1, cpu:          1 } hitcount:          4
       { common_cpu:          6, cpu:          6 } hitcount:          4
       { common_cpu:          5, cpu:          5 } hitcount:         14
       { common_cpu:          4, cpu:          4 } hitcount:         26
       { common_cpu:          0, cpu:          0 } hitcount:         39
       { common_cpu:          2, cpu:          2 } hitcount:        184
      
      Now for backward compatibility, I added a trick. If "cpu" is used, and
      the field is not found, it will fall back to "common_cpu" and work as
      it did before. This way, it will still work for old programs that use
      "cpu" to get the actual CPU, but if the event has a "cpu" as a field, it
      will get that event's "cpu" field, which is probably what it wants
      anyway.
      
      I updated the tracefs/README to include documentation about both the
      common_timestamp and the common_cpu. This way, if that text is present in
      the README, then an application can know that common_cpu is supported over
      just plain "cpu".
      
      Link: https://lkml.kernel.org/r/20210721110053.26b4f641@oasis.local.home
      
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: stable@vger.kernel.org
      Fixes: 8b7622bf
      
       ("tracing: Add cpu field for hist triggers")
      Reviewed-by: default avatarTom Zanussi <zanussi@kernel.org>
      Reviewed-by: default avatarMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      29ecaddb
    • Steven Rostedt (VMware)'s avatar
      tracepoints: Update static_call before tp_funcs when adding a tracepoint · 58f47cfe
      Steven Rostedt (VMware) authored
      commit 352384d5 upstream.
      
      Because of the significant overhead that retpolines pose on indirect
      calls, the tracepoint code was updated to use the new "static_calls" that
      can modify the running code to directly call a function instead of using
      an indirect caller, and this function can be changed at runtime.
      
      In the tracepoint code that calls all the registered callbacks that are
      attached to a tracepoint, the following is done:
      
      	it_func_ptr = rcu_dereference_raw((&__tracepoint_##name)->funcs);
      	if (it_func_ptr) {
      		__data = (it_func_ptr)->data;
      		static_call(tp_func_##name)(__data, args);
      	}
      
      If there's just a single callback, the static_call is updated to just call
      that callback directly. Once another handler is added, then the static
      caller is updated to call the iterator, that simply loops over all the
      funcs in the array and calls each of the callbacks like the old method
      using indirect calling.
      
      The issue was discovered with a race between updating the funcs array and
      updating the static_call. The funcs array was updated first and then the
      static_call was updated. This is not an issue as long as the first element
      in the old array is the same as the first element in the new array. But
      that assumption is incorrect, because callbacks also have a priority
      field, and if there's a callback added that has a higher priority than the
      callback on the old array, then it will become the first callback in the
      new array. This means that it is possible to call the old callback with
      the new callback data element, which can cause a kernel panic.
      
      	static_call = callback1()
      	funcs[] = {callback1,data1};
      	callback2 has higher priority than callback1
      
      	CPU 1				CPU 2
      	-----				-----
      
         new_funcs = {callback2,data2},
                     {callback1,data1}
      
         rcu_assign_pointer(tp->funcs, new_funcs);
      
        /*
         * Now tp->funcs has the new array
         * but the static_call still calls callback1
         */
      
      				it_func_ptr = tp->funcs [ new_funcs ]
      				data = it_func_ptr->data [ data2 ]
      				static_call(callback1, data);
      
      				/* Now callback1 is called with
      				 * callback2's data */
      
      				[ KERNEL PANIC ]
      
         update_static_call(iterator);
      
      To prevent this from happening, always switch the static_call to the
      iterator before assigning the tp->funcs to the new array. The iterator will
      always properly match the callback with its data.
      
      To trigger this bug:
      
        In one terminal:
      
          while :; do hackbench 50; done
      
        In another terminal
      
          echo 1 > /sys/kernel/tracing/events/sched/sched_waking/enable
          while :; do
              echo 1 > /sys/kernel/tracing/set_event_pid;
              sleep 0.5
              echo 0 > /sys/kernel/tracing/set_event_pid;
              sleep 0.5
         done
      
      And it doesn't take long to crash. This is because the set_event_pid adds
      a callback to the sched_waking tracepoint with a high priority, which will
      be called before the sched_waking trace event callback is called.
      
      Note, the removal to a single callback updates the array first, before
      changing the static_call to single callback, which is the proper order as
      the first element in the array is the same as what the static_call is
      being changed to.
      
      Link: https://lore.kernel.org/io-uring/4ebea8f0-58c9-e571-fd30-0ce4f6f09c70@samba.org/
      
      Cc: stable@vger.kernel.org
      Fixes: d25e37d8
      
       ("tracepoint: Optimize using static_call()")
      Reported-by: default avatarStefan Metzmacher <metze@samba.org>
      tested-by: default avatarStefan Metzmacher <metze@samba.org>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      58f47cfe