Skip to content
  1. Jul 07, 2021
  2. Jun 30, 2021
    • Sasha Levin's avatar
    • Eric Snowberg's avatar
      integrity: Load mokx variables into the blacklist keyring · 1573d595
      Eric Snowberg authored
      [ Upstream commit ebd9c2ae
      
       ]
      
      During boot the Secure Boot Forbidden Signature Database, dbx,
      is loaded into the blacklist keyring.  Systems booted with shim
      have an equivalent Forbidden Signature Database called mokx.
      Currently mokx is only used by shim and grub, the contents are
      ignored by the kernel.
      
      Add the ability to load mokx into the blacklist keyring during boot.
      
      Signed-off-by: default avatarEric Snowberg <eric.snowberg@oracle.com>
      Suggested-by: default avatarJames Bottomley <James.Bottomley@HansenPartnership.com>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Reviewed-by: default avatarJarkko Sakkinen <jarkko@kernel.org>
      cc: keyrings@vger.kernel.org
      Link: https://lore.kernel.org/r/c33c8e3839a41e9654f41cc92c7231104931b1d7.camel@HansenPartnership.com/
      Link: https://lore.kernel.org/r/20210122181054.32635-5-eric.snowberg@oracle.com/ # v5
      Link: https://lore.kernel.org/r/161428674320.677100.12637282414018170743.stgit@warthog.procyon.org.uk/
      Link: https://lore.kernel.org/r/161433313205.902181.2502803393898221637.stgit@warthog.procyon.org.uk/ # v2
      Link: https://lore.kernel.org/r/161529607422.163428.13530426573612578854.stgit@warthog.procyon.org.uk/ # v3
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      1573d595
    • Eric Snowberg's avatar
      certs: Add ability to preload revocation certs · c6ae6f89
      Eric Snowberg authored
      [ Upstream commit d1f04410
      
       ]
      
      Add a new Kconfig option called SYSTEM_REVOCATION_KEYS. If set,
      this option should be the filename of a PEM-formated file containing
      X.509 certificates to be included in the default blacklist keyring.
      
      DH Changes:
       - Make the new Kconfig option depend on SYSTEM_REVOCATION_LIST.
       - Fix SYSTEM_REVOCATION_KEYS=n, but CONFIG_SYSTEM_REVOCATION_LIST=y[1][2].
       - Use CONFIG_SYSTEM_REVOCATION_LIST for extract-cert[3].
       - Use CONFIG_SYSTEM_REVOCATION_LIST for revocation_certificates.o[3].
      
      Signed-off-by: default avatarEric Snowberg <eric.snowberg@oracle.com>
      Acked-by: default avatarJarkko Sakkinen <jarkko@kernel.org>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Randy Dunlap <rdunlap@infradead.org>
      cc: keyrings@vger.kernel.org
      Link: https://lore.kernel.org/r/e1c15c74-82ce-3a69-44de-a33af9b320ea@infradead.org/ [1]
      Link: https://lore.kernel.org/r/20210303034418.106762-1-eric.snowberg@oracle.com/ [2]
      Link: https://lore.kernel.org/r/20210304175030.184131-1-eric.snowberg@oracle.com/ [3]
      Link: https://lore.kernel.org/r/20200930201508.35113-3-eric.snowberg@oracle.com/
      Link: https://lore.kernel.org/r/20210122181054.32635-4-eric.snowberg@oracle.com/ # v5
      Link: https://lore.kernel.org/r/161428673564.677100.4112098280028451629.stgit@warthog.procyon.org.uk/
      Link: https://lore.kernel.org/r/161433312452.902181.4146169951896577982.stgit@warthog.procyon.org.uk/ # v2
      Link: https://lore.kernel.org/r/161529606657.163428.3340689182456495390.stgit@warthog.procyon.org.uk/ # v3
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      c6ae6f89
    • Eric Snowberg's avatar
      certs: Move load_system_certificate_list to a common function · 72d6f5d9
      Eric Snowberg authored
      [ Upstream commit 2565ca7f
      
       ]
      
      Move functionality within load_system_certificate_list to a common
      function, so it can be reused in the future.
      
      DH Changes:
       - Added inclusion of common.h to common.c (Eric [1]).
      
      Signed-off-by: default avatarEric Snowberg <eric.snowberg@oracle.com>
      Acked-by: default avatarJarkko Sakkinen <jarkko@kernel.org>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: keyrings@vger.kernel.org
      Link: https://lore.kernel.org/r/EDA280F9-F72D-4181-93C7-CDBE95976FF7@oracle.com/ [1]
      Link: https://lore.kernel.org/r/20200930201508.35113-2-eric.snowberg@oracle.com/
      Link: https://lore.kernel.org/r/20210122181054.32635-3-eric.snowberg@oracle.com/ # v5
      Link: https://lore.kernel.org/r/161428672825.677100.7545516389752262918.stgit@warthog.procyon.org.uk/
      Link: https://lore.kernel.org/r/161433311696.902181.3599366124784670368.stgit@warthog.procyon.org.uk/ # v2
      Link: https://lore.kernel.org/r/161529605850.163428.7786675680201528556.stgit@warthog.procyon.org.uk/ # v3
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      72d6f5d9
    • Eric Snowberg's avatar
      certs: Add EFI_CERT_X509_GUID support for dbx entries · 45109066
      Eric Snowberg authored
      [ Upstream commit 56c58126
      
       ]
      
      This fixes CVE-2020-26541.
      
      The Secure Boot Forbidden Signature Database, dbx, contains a list of now
      revoked signatures and keys previously approved to boot with UEFI Secure
      Boot enabled.  The dbx is capable of containing any number of
      EFI_CERT_X509_SHA256_GUID, EFI_CERT_SHA256_GUID, and EFI_CERT_X509_GUID
      entries.
      
      Currently when EFI_CERT_X509_GUID are contained in the dbx, the entries are
      skipped.
      
      Add support for EFI_CERT_X509_GUID dbx entries. When a EFI_CERT_X509_GUID
      is found, it is added as an asymmetrical key to the .blacklist keyring.
      Anytime the .platform keyring is used, the keys in the .blacklist keyring
      are referenced, if a matching key is found, the key will be rejected.
      
      [DH: Made the following changes:
       - Added to have a config option to enable the facility.  This allows a
         Kconfig solution to make sure that pkcs7_validate_trust() is
         enabled.[1][2]
       - Moved the functions out from the middle of the blacklist functions.
       - Added kerneldoc comments.]
      
      Signed-off-by: default avatarEric Snowberg <eric.snowberg@oracle.com>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Reviewed-by: default avatarJarkko Sakkinen <jarkko@kernel.org>
      cc: Randy Dunlap <rdunlap@infradead.org>
      cc: Mickaël Salaün <mic@digikod.net>
      cc: Arnd Bergmann <arnd@kernel.org>
      cc: keyrings@vger.kernel.org
      Link: https://lore.kernel.org/r/20200901165143.10295-1-eric.snowberg@oracle.com/ # rfc
      Link: https://lore.kernel.org/r/20200909172736.73003-1-eric.snowberg@oracle.com/ # v2
      Link: https://lore.kernel.org/r/20200911182230.62266-1-eric.snowberg@oracle.com/ # v3
      Link: https://lore.kernel.org/r/20200916004927.64276-1-eric.snowberg@oracle.com/ # v4
      Link: https://lore.kernel.org/r/20210122181054.32635-2-eric.snowberg@oracle.com/ # v5
      Link: https://lore.kernel.org/r/161428672051.677100.11064981943343605138.stgit@warthog.procyon.org.uk/
      Link: https://lore.kernel.org/r/161433310942.902181.4901864302675874242.stgit@warthog.procyon.org.uk/ # v2
      Link: https://lore.kernel.org/r/161529605075.163428.14625520893961300757.stgit@warthog.procyon.org.uk/ # v3
      Link: https://lore.kernel.org/r/bc2c24e3-ed68-2521-0bf4-a1f6be4a895d@infradead.org/ [1]
      Link: https://lore.kernel.org/r/20210225125638.1841436-1-arnd@kernel.org/ [2]
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      45109066
    • Daniel Vetter's avatar
      Revert "drm: add a locked version of drm_is_current_master" · 0ba128fa
      Daniel Vetter authored
      commit f54b3ca7 upstream.
      
      This reverts commit 1815d9c8.
      
      Unfortunately this inverts the locking hierarchy, so back to the
      drawing board. Full lockdep splat below:
      
      ======================================================
      WARNING: possible circular locking dependency detected
      5.13.0-rc7-CI-CI_DRM_10254+ #1 Not tainted
      ------------------------------------------------------
      kms_frontbuffer/1087 is trying to acquire lock:
      ffff88810dcd01a8 (&dev->master_mutex){+.+.}-{3:3}, at: drm_is_current_master+0x1b/0x40
      but task is already holding lock:
      ffff88810dcd0488 (&dev->mode_config.mutex){+.+.}-{3:3}, at: drm_mode_getconnector+0x1c6/0x4a0
      which lock already depends on the new lock.
      the existing dependency chain (in reverse order) is:
      -> #2 (&dev->mode_config.mutex){+.+.}-{3:3}:
             __mutex_lock+0xab/0x970
             drm_client_modeset_probe+0x22e/0xca0
             __drm_fb_helper_initial_config_and_unlock+0x42/0x540
             intel_fbdev_initial_config+0xf/0x20 [i915]
             async_run_entry_fn+0x28/0x130
             process_one_work+0x26d/0x5c0
             worker_thread+0x37/0x380
             kthread+0x144/0x170
             ret_from_fork+0x1f/0x30
      -> #1 (&client->modeset_mutex){+.+.}-{3:3}:
             __mutex_lock+0xab/0x970
             drm_client_modeset_commit_locked+0x1c/0x180
             drm_client_modeset_commit+0x1c/0x40
             __drm_fb_helper_restore_fbdev_mode_unlocked+0x88/0xb0
             drm_fb_helper_set_par+0x34/0x40
             intel_fbdev_set_par+0x11/0x40 [i915]
             fbcon_init+0x270/0x4f0
             visual_init+0xc6/0x130
             do_bind_con_driver+0x1e5/0x2d0
             do_take_over_console+0x10e/0x180
             do_fbcon_takeover+0x53/0xb0
             register_framebuffer+0x22d/0x310
             __drm_fb_helper_initial_config_and_unlock+0x36c/0x540
             intel_fbdev_initial_config+0xf/0x20 [i915]
             async_run_entry_fn+0x28/0x130
             process_one_work+0x26d/0x5c0
             worker_thread+0x37/0x380
             kthread+0x144/0x170
             ret_from_fork+0x1f/0x30
      -> #0 (&dev->master_mutex){+.+.}-{3:3}:
             __lock_acquire+0x151e/0x2590
             lock_acquire+0xd1/0x3d0
             __mutex_lock+0xab/0x970
             drm_is_current_master+0x1b/0x40
             drm_mode_getconnector+0x37e/0x4a0
             drm_ioctl_kernel+0xa8/0xf0
             drm_ioctl+0x1e8/0x390
             __x64_sys_ioctl+0x6a/0xa0
             do_syscall_64+0x39/0xb0
             entry_SYSCALL_64_after_hwframe+0x44/0xae
      other info that might help us debug this:
      Chain exists of: &dev->master_mutex --> &client->modeset_mutex --> &dev->mode_config.mutex
       Possible unsafe locking scenario:
             CPU0                    CPU1
             ----                    ----
        lock(&dev->mode_config.mutex);
                                     lock(&client->modeset_mutex);
                                     lock(&dev->mode_config.mutex);
        lock(&dev->master_mutex);
      0ba128fa
    • Jeff Layton's avatar
      netfs: fix test for whether we can skip read when writing beyond EOF · 0463b49e
      Jeff Layton authored
      commit 827a746f upstream.
      
      It's not sufficient to skip reading when the pos is beyond the EOF.
      There may be data at the head of the page that we need to fill in
      before the write.
      
      Add a new helper function that corrects and clarifies the logic of
      when we can skip reads, and have it only zero out the part of the page
      that won't have data copied in for the write.
      
      Finally, don't set the page Uptodate after zeroing. It's not up to date
      since the write data won't have been copied in yet.
      
      [DH made the following changes:
      
       - Prefixed the new function with "netfs_".
      
       - Don't call zero_user_segments() for a full-page write.
      
       - Altered the beyond-last-page check to avoid a DIV instruction and got
         rid of then-redundant zero-length file check.
      ]
      
      [ Note: this fix is commit 827a746f in mainline kernels. The
      	original bug was in ceph, but got lifted into the fs/netfs
      	library for v5.13. This backport should apply to stable
      	kernels v5.10 though v5.12. ]
      
      Fixes: e1b1240c
      
       ("netfs: Add write_begin helper")
      Reported-by: default avatarAndrew W Elble <aweits@rit.edu>
      Signed-off-by: default avatarJeff Layton <jlayton@kernel.org>
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      cc: ceph-devel@vger.kernel.org
      Link: https://lore.kernel.org/r/20210613233345.113565-1-jlayton@kernel.org/
      Link: https://lore.kernel.org/r/162367683365.460125.4467036947364047314.stgit@warthog.procyon.org.uk/ # v1
      Link: https://lore.kernel.org/r/162391826758.1173366.11794946719301590013.stgit@warthog.procyon.org.uk/ # v2
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0463b49e
    • Bumyong Lee's avatar
      swiotlb: manipulate orig_addr when tlb_addr has offset · e6108147
      Bumyong Lee authored
      commit 5f89468e upstream.
      
      in case of driver wants to sync part of ranges with offset,
      swiotlb_tbl_sync_single() copies from orig_addr base to tlb_addr with
      offset and ends up with data mismatch.
      
      It was removed from
      "swiotlb: don't modify orig_addr in swiotlb_tbl_sync_single",
      but said logic has to be added back in.
      
      From Linus's email:
      "That commit which the removed the offset calculation entirely, because the old
      
              (unsigned long)tlb_addr & (IO_TLB_SIZE - 1)
      
      was wrong, but instead of removing it, I think it should have just
      fixed it to be
      
              (tlb_addr - mem->start) & (IO_TLB_SIZE - 1);
      
      instead. That way the slot offset always matches the slot index calculation."
      
      (Unfortunatly that broke NVMe).
      
      The use-case that drivers are hitting is as follow:
      
      1. Get dma_addr_t from dma_map_single()
      
      dma_addr_t tlb_addr = dma_map_single(dev, vaddr, vsize, DMA_TO_DEVICE);
      
          |<---------------vsize------------->|
          +-----------------------------------+
          |                                   | original buffer
          +-----------------------------------+
        vaddr
      
       swiotlb_align_offset
           |<----->|<---------------vsize------------->|
           +-------+-----------------------------------+
           |       |                                   | swiotlb buffer
           +-------+-----------------------------------+
                tlb_addr
      
      2. Do something
      3. Sync dma_addr_t through dma_sync_single_for_device(..)
      
      dma_sync_single_for_device(dev, tlb_addr + offset, size, DMA_TO_DEVICE);
      
        Error case.
          Copy data to original buffer but it is from base addr (instead of
        base addr + offset) in original buffer:
      
       swiotlb_align_offset
           |<----->|<- offset ->|<- size ->|
           +-------+-----------------------------------+
           |       |            |##########|           | swiotlb buffer
           +-------+-----------------------------------+
                tlb_addr
      
          |<- size ->|
          +-----------------------------------+
          |##########|                        | original buffer
          +-----------------------------------+
        vaddr
      
      The fix is to copy the data to the original buffer and take into
      account the offset, like so:
      
       swiotlb_align_offset
           |<----->|<- offset ->|<- size ->|
           +-------+-----------------------------------+
           |       |            |##########|           | swiotlb buffer
           +-------+-----------------------------------+
                tlb_addr
      
          |<- offset ->|<- size ->|
          +-----------------------------------+
          |            |##########|           | original buffer
          +-----------------------------------+
        vaddr
      
      [One fix which was Linus's that made more sense to as it created a
      symmetry would break NVMe. The reason for that is the:
       unsigned int offset = (tlb_addr - mem->start) & (IO_TLB_SIZE - 1);
      
      would come up with the proper offset, but it would lose the
      alignment (which this patch contains).]
      
      Fixes: 16fc3cef
      
       ("swiotlb: don't modify orig_addr in swiotlb_tbl_sync_single")
      Signed-off-by: default avatarBumyong Lee <bumyong.lee@samsung.com>
      Signed-off-by: default avatarChanho Park <chanho61.park@samsung.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reported-by: default avatarDominique MARTINET <dominique.martinet@atmark-techno.com>
      Reported-by: default avatarHoria Geantă <horia.geanta@nxp.com>
      Tested-by: default avatarHoria Geantă <horia.geanta@nxp.com>
      CC: stable@vger.kernel.org
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e6108147
    • Alper Gun's avatar
      KVM: SVM: Call SEV Guest Decommission if ASID binding fails · 7570a8b5
      Alper Gun authored
      commit 934002cd upstream.
      
      Send SEV_CMD_DECOMMISSION command to PSP firmware if ASID binding
      fails. If a failure happens after  a successful LAUNCH_START command,
      a decommission command should be executed. Otherwise, guest context
      will be unfreed inside the AMD SP. After the firmware will not have
      memory to allocate more SEV guest context, LAUNCH_START command will
      begin to fail with SEV_RET_RESOURCE_LIMIT error.
      
      The existing code calls decommission inside sev_unbind_asid, but it is
      not called if a failure happens before guest activation succeeds. If
      sev_bind_asid fails, decommission is never called. PSP firmware has a
      limit for the number of guests. If sev_asid_binding fails many times,
      PSP firmware will not have resources to create another guest context.
      
      Cc: stable@vger.kernel.org
      Fixes: 59414c98
      
       ("KVM: SVM: Add support for KVM_SEV_LAUNCH_START command")
      Reported-by: default avatarPeter Gonda <pgonda@google.com>
      Signed-off-by: default avatarAlper Gun <alpergun@google.com>
      Reviewed-by: default avatarMarc Orr <marcorr@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20210610174604.2554090-1-alpergun@google.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7570a8b5
    • Hugh Dickins's avatar
      mm, futex: fix shared futex pgoff on shmem huge page · 377a796e
      Hugh Dickins authored
      commit fe19bd3d upstream.
      
      If more than one futex is placed on a shmem huge page, it can happen
      that waking the second wakes the first instead, and leaves the second
      waiting: the key's shared.pgoff is wrong.
      
      When 3.11 commit 13d60f4b ("futex: Take hugepages into account when
      generating futex_key"), the only shared huge pages came from hugetlbfs,
      and the code added to deal with its exceptional page->index was put into
      hugetlb source.  Then that was missed when 4.8 added shmem huge pages.
      
      page_to_pgoff() is what others use for this nowadays: except that, as
      currently written, it gives the right answer on hugetlbfs head, but
      nonsense on hugetlbfs tails.  Fix that by calling hugetlbfs-specific
      hugetlb_basepage_index() on PageHuge tails as well as on head.
      
      Yes, it's unconventional to declare hugetlb_basepage_index() there in
      pagemap.h, rather than in hugetlb.h; but I do not expect anything but
      page_to_pgoff() ever to need it.
      
      [akpm@linux-foundation.org: give hugetlb_basepage_index() prototype the correct scope]
      
      Link: https://lkml.kernel.org/r/b17d946b-d09-326e-b42a-52884c36df32@google.com
      Fixes: 800d8c63
      
       ("shmem: add huge pages support")
      Reported-by: default avatarNeel Natu <neelnatu@google.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Acked-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Zhang Yi <wetpzy@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Darren Hart <dvhart@infradead.org>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      377a796e
    • Hugh Dickins's avatar
      mm/thp: another PVMW_SYNC fix in page_vma_mapped_walk() · ab9d1781
      Hugh Dickins authored
      commit a7a69d8b upstream.
      
      Aha! Shouldn't that quick scan over pte_none()s make sure that it holds
      ptlock in the PVMW_SYNC case? That too might have been responsible for
      BUGs or WARNs in split_huge_page_to_list() or its unmap_page(), though
      I've never seen any.
      
      Link: https://lkml.kernel.org/r/1bdf384c-8137-a149-2a1e-475a4791c3c@google.com
      Link: https://lore.kernel.org/linux-mm/20210412180659.B9E3.409509F4@e16-tech.com/
      Fixes: ace71a19
      
       ("mm: introduce page_vma_mapped_walk()")
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: default avatarWang Yugui <wangyugui@e16-tech.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ab9d1781
    • Hugh Dickins's avatar
      mm/thp: fix page_vma_mapped_walk() if THP mapped by ptes · 915c3a26
      Hugh Dickins authored
      commit a9a7504d upstream.
      
      Running certain tests with a DEBUG_VM kernel would crash within hours,
      on the total_mapcount BUG() in split_huge_page_to_list(), while trying
      to free up some memory by punching a hole in a shmem huge page: split's
      try_to_unmap() was unable to find all the mappings of the page (which,
      on a !DEBUG_VM kernel, would then keep the huge page pinned in memory).
      
      Crash dumps showed two tail pages of a shmem huge page remained mapped
      by pte: ptes in a non-huge-aligned vma of a gVisor process, at the end
      of a long unmapped range; and no page table had yet been allocated for
      the head of the huge page to be mapped into.
      
      Although designed to handle these odd misaligned huge-page-mapped-by-pte
      cases, page_vma_mapped_walk() falls short by returning false prematurely
      when !pmd_present or !pud_present or !p4d_present or !pgd_present: there
      are cases when a huge page may span the boundary, with ptes present in
      the next.
      
      Restructure page_vma_mapped_walk() as a loop to continue in these cases,
      while keeping its layout much as before.  Add a step_forward() helper to
      advance pvmw->address across those boundaries: originally I tried to use
      mm's standard p?d_addr_end() macros, but hit the same crash 512 times
      less often: because of the way redundant levels are folded together, but
      folded differently in different configurations, it was just too
      difficult to use them correctly; and step_forward() is simpler anyway.
      
      Link: https://lkml.kernel.org/r/fedb8632-1798-de42-f39e-873551d5bc81@google.com
      Fixes: ace71a19
      
       ("mm: introduce page_vma_mapped_walk()")
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Wang Yugui <wangyugui@e16-tech.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      915c3a26
    • Hugh Dickins's avatar
      mm: page_vma_mapped_walk(): get vma_address_end() earlier · 90073aec
      Hugh Dickins authored
      commit a765c417
      
       upstream.
      
      page_vma_mapped_walk() cleanup: get THP's vma_address_end() at the
      start, rather than later at next_pte.
      
      It's a little unnecessary overhead on the first call, but makes for a
      simpler loop in the following commit.
      
      Link: https://lkml.kernel.org/r/4542b34d-862f-7cb4-bb22-e0df6ce830a2@google.com
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Wang Yugui <wangyugui@e16-tech.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      90073aec
    • Hugh Dickins's avatar
      mm: page_vma_mapped_walk(): use goto instead of while (1) · bf60fc23
      Hugh Dickins authored
      commit 47446630
      
       upstream.
      
      page_vma_mapped_walk() cleanup: add a label this_pte, matching next_pte,
      and use "goto this_pte", in place of the "while (1)" loop at the end.
      
      Link: https://lkml.kernel.org/r/a52b234a-851-3616-2525-f42736e8934@google.com
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Wang Yugui <wangyugui@e16-tech.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bf60fc23
    • Hugh Dickins's avatar
      mm: page_vma_mapped_walk(): add a level of indentation · 9f85dcaf
      Hugh Dickins authored
      commit b3807a91
      
       upstream.
      
      page_vma_mapped_walk() cleanup: add a level of indentation to much of
      the body, making no functional change in this commit, but reducing the
      later diff when this is all converted to a loop.
      
      [hughd@google.com: : page_vma_mapped_walk(): add a level of indentation fix]
        Link: https://lkml.kernel.org/r/7f817555-3ce1-c785-e438-87d8efdcaf26@google.com
      
      Link: https://lkml.kernel.org/r/efde211-f3e2-fe54-977-ef481419e7f3@google.com
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Wang Yugui <wangyugui@e16-tech.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9f85dcaf
    • Hugh Dickins's avatar
      mm: page_vma_mapped_walk(): crossing page table boundary · e56bdb39
      Hugh Dickins authored
      commit 44828248
      
       upstream.
      
      page_vma_mapped_walk() cleanup: adjust the test for crossing page table
      boundary - I believe pvmw->address is always page-aligned, but nothing
      else here assumed that; and remember to reset pvmw->pte to NULL after
      unmapping the page table, though I never saw any bug from that.
      
      Link: https://lkml.kernel.org/r/799b3f9c-2a9e-dfef-5d89-26e9f76fd97@google.com
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Wang Yugui <wangyugui@e16-tech.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e56bdb39
    • Hugh Dickins's avatar
      mm: page_vma_mapped_walk(): prettify PVMW_MIGRATION block · 8dc191ed
      Hugh Dickins authored
      commit e2e1d407
      
       upstream.
      
      page_vma_mapped_walk() cleanup: rearrange the !pmd_present() block to
      follow the same "return not_found, return not_found, return true"
      pattern as the block above it (note: returning not_found there is never
      premature, since existence or prior existence of huge pmd guarantees
      good alignment).
      
      Link: https://lkml.kernel.org/r/378c8650-1488-2edf-9647-32a53cf2e21@google.com
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Wang Yugui <wangyugui@e16-tech.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8dc191ed
    • Hugh Dickins's avatar
      mm: page_vma_mapped_walk(): use pmde for *pvmw->pmd · 7b55a4bc
      Hugh Dickins authored
      commit 3306d311
      
       upstream.
      
      page_vma_mapped_walk() cleanup: re-evaluate pmde after taking lock, then
      use it in subsequent tests, instead of repeatedly dereferencing pointer.
      
      Link: https://lkml.kernel.org/r/53fbc9d-891e-46b2-cb4b-468c3b19238e@google.com
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Wang Yugui <wangyugui@e16-tech.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7b55a4bc
    • Hugh Dickins's avatar
      mm: page_vma_mapped_walk(): settle PageHuge on entry · 1cb0b905
      Hugh Dickins authored
      commit 6d0fd598
      
       upstream.
      
      page_vma_mapped_walk() cleanup: get the hugetlbfs PageHuge case out of
      the way at the start, so no need to worry about it later.
      
      Link: https://lkml.kernel.org/r/e31a483c-6d73-a6bb-26c5-43c3b880a2@google.com
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Wang Yugui <wangyugui@e16-tech.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1cb0b905
    • Hugh Dickins's avatar
      mm: page_vma_mapped_walk(): use page for pvmw->page · 65febb41
      Hugh Dickins authored
      commit f003c03b
      
       upstream.
      
      Patch series "mm: page_vma_mapped_walk() cleanup and THP fixes".
      
      I've marked all of these for stable: many are merely cleanups, but I
      think they are much better before the main fix than after.
      
      This patch (of 11):
      
      page_vma_mapped_walk() cleanup: sometimes the local copy of pvwm->page
      was used, sometimes pvmw->page itself: use the local copy "page"
      throughout.
      
      Link: https://lkml.kernel.org/r/589b358c-febc-c88e-d4c2-7834b37fa7bf@google.com
      Link: https://lkml.kernel.org/r/88e67645-f467-c279-bf5e-af4b5c6b13eb@google.com
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarAlistair Popple <apopple@nvidia.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Wang Yugui <wangyugui@e16-tech.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      65febb41
    • Yang Shi's avatar
      mm: thp: replace DEBUG_VM BUG with VM_WARN when unmap fails for split · 825c2805
      Yang Shi authored
      [ Upstream commit 504e070d
      
       ]
      
      When debugging the bug reported by Wang Yugui [1], try_to_unmap() may
      fail, but the first VM_BUG_ON_PAGE() just checks page_mapcount() however
      it may miss the failure when head page is unmapped but other subpage is
      mapped.  Then the second DEBUG_VM BUG() that check total mapcount would
      catch it.  This may incur some confusion.
      
      As this is not a fatal issue, so consolidate the two DEBUG_VM checks
      into one VM_WARN_ON_ONCE_PAGE().
      
      [1] https://lore.kernel.org/linux-mm/20210412180659.B9E3.409509F4@e16-tech.com/
      
      Link: https://lkml.kernel.org/r/d0f0db68-98b8-ebfb-16dc-f29df24cf012@google.com
      Signed-off-by: default avatarYang Shi <shy828301@gmail.com>
      Reviewed-by: default avatarZi Yan <ziy@nvidia.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jue Wang <juew@google.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Wang Yugui <wangyugui@e16-tech.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      
      Note on stable backport: fixed up variables in split_huge_page_to_list().
      
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      825c2805
    • Hugh Dickins's avatar
      mm/thp: unmap_mapping_page() to fix THP truncate_cleanup_page() · 0010275c
      Hugh Dickins authored
      [ Upstream commit 22061a1f ]
      
      There is a race between THP unmapping and truncation, when truncate sees
      pmd_none() and skips the entry, after munmap's zap_huge_pmd() cleared
      it, but before its page_remove_rmap() gets to decrement
      compound_mapcount: generating false "BUG: Bad page cache" reports that
      the page is still mapped when deleted.  This commit fixes that, but not
      in the way I hoped.
      
      The first attempt used try_to_unmap(page, TTU_SYNC|TTU_IGNORE_MLOCK)
      instead of unmap_mapping_range() in truncate_cleanup_page(): it has
      often been an annoyance that we usually call unmap_mapping_range() with
      no pages locked, but there apply it to a single locked page.
      try_to_unmap() looks more suitable for a single locked page.
      
      However, try_to_unmap_one() contains a VM_BUG_ON_PAGE(!pvmw.pte,page):
      it is used to insert THP migration entries, but not used to unmap THPs.
      Copy zap_huge_pmd() and add THP handling now? Perhaps, but their TLB
      needs are different, I'm too ignorant of the DAX cases, and couldn't
      decide how far to go for anon+swap.  Set that aside.
      
      The second attempt took a different tack: make no change in truncate.c,
      but modify zap_huge_pmd() to insert an invalidated huge pmd instead of
      clearing it initially, then pmd_clear() between page_remove_rmap() and
      unlocking at the end.  Nice.  But powerpc blows that approach out of the
      water, with its serialize_against_pte_lookup(), and interesting pgtable
      usage.  It would need serious help to get working on powerpc (with a
      minor optimization issue on s390 too).  Set that aside.
      
      Just add an "if (page_mapped(page)) synchronize_rcu();" or other such
      delay, after unmapping in truncate_cleanup_page()? Perhaps, but though
      that's likely to reduce or eliminate the number of incidents, it would
      give less assurance of whether we had identified the problem correctly.
      
      This successful iteration introduces "unmap_mapping_page(page)" instead
      of try_to_unmap(), and goes the usual unmap_mapping_range_tree() route,
      with an addition to details.  Then zap_pmd_range() watches for this
      case, and does spin_unlock(pmd_lock) if so - just like
      page_vma_mapped_walk() now does in the PVMW_SYNC case.  Not pretty, but
      safe.
      
      Note that unmap_mapping_page() is doing a VM_BUG_ON(!PageLocked) to
      assert its interface; but currently that's only used to make sure that
      page->mapping is stable, and zap_pmd_range() doesn't care if the page is
      locked or not.  Along these lines, in invalidate_inode_pages2_range()
      move the initial unmap_mapping_range() out from under page lock, before
      then calling unmap_mapping_page() under page lock if still mapped.
      
      Link: https://lkml.kernel.org/r/a2a4a148-cdd8-942c-4ef8-51b77f643dbe@google.com
      Fixes: fc127da0
      
       ("truncate: handle file thp")
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jue Wang <juew@google.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Wang Yugui <wangyugui@e16-tech.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      
      Note on stable backport: fixed up call to truncate_cleanup_page()
      in truncate_inode_pages_range().
      
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0010275c
    • Jue Wang's avatar
      mm/thp: fix page_address_in_vma() on file THP tails · 38cda6b5
      Jue Wang authored
      commit 31657170 upstream.
      
      Anon THP tails were already supported, but memory-failure may need to
      use page_address_in_vma() on file THP tails, which its page->mapping
      check did not permit: fix it.
      
      hughd adds: no current usage is known to hit the issue, but this does
      fix a subtle trap in a general helper: best fixed in stable sooner than
      later.
      
      Link: https://lkml.kernel.org/r/a0d9b53-bf5d-8bab-ac5-759dc61819c1@google.com
      Fixes: 800d8c63
      
       ("shmem: add huge pages support")
      Signed-off-by: default avatarJue Wang <juew@google.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Wang Yugui <wangyugui@e16-tech.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      38cda6b5
    • Hugh Dickins's avatar
      mm/thp: fix vma_address() if virtual address below file offset · 37ffe9f4
      Hugh Dickins authored
      commit 494334e4 upstream.
      
      Running certain tests with a DEBUG_VM kernel would crash within hours,
      on the total_mapcount BUG() in split_huge_page_to_list(), while trying
      to free up some memory by punching a hole in a shmem huge page: split's
      try_to_unmap() was unable to find all the mappings of the page (which,
      on a !DEBUG_VM kernel, would then keep the huge page pinned in memory).
      
      When that BUG() was changed to a WARN(), it would later crash on the
      VM_BUG_ON_VMA(end < vma->vm_start || start >= vma->vm_end, vma) in
      mm/internal.h:vma_address(), used by rmap_walk_file() for
      try_to_unmap().
      
      vma_address() is usually correct, but there's a wraparound case when the
      vm_start address is unusually low, but vm_pgoff not so low:
      vma_address() chooses max(start, vma->vm_start), but that decides on the
      wrong address, because start has become almost ULONG_MAX.
      
      Rewrite vma_address() to be more careful about vm_pgoff; move the
      VM_BUG_ON_VMA() out of it, returning -EFAULT for errors, so that it can
      be safely used from page_mapped_in_vma() and page_address_in_vma() too.
      
      Add vma_address_end() to apply similar care to end address calculation,
      in page_vma_mapped_walk() and page_mkclean_one() and try_to_unmap_one();
      though it raises a question of whether callers would do better to supply
      pvmw->end to page_vma_mapped_walk() - I chose not, for a smaller patch.
      
      An irritation is that their apparent generality breaks down on KSM
      pages, which cannot be located by the page->index that page_to_pgoff()
      uses: as commit 4b0ece6f ("mm: migrate: fix remove_migration_pte()
      for ksm pages") once discovered.  I dithered over the best thing to do
      about that, and have ended up with a VM_BUG_ON_PAGE(PageKsm) in both
      vma_address() and vma_address_end(); though the only place in danger of
      using it on them was try_to_unmap_one().
      
      Sidenote: vma_address() and vma_address_end() now use compound_nr() on a
      head page, instead of thp_size(): to make the right calculation on a
      hugetlbfs page, whether or not THPs are configured.  try_to_unmap() is
      used on hugetlbfs pages, but perhaps the wrong calculation never
      mattered.
      
      Link: https://lkml.kernel.org/r/caf1c1a3-7cfb-7f8f-1beb-ba816e932825@google.com
      Fixes: a8fa41ad
      
       ("mm, rmap: check all VMAs that PTE-mapped THP can be part of")
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jue Wang <juew@google.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Wang Yugui <wangyugui@e16-tech.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      37ffe9f4
    • Hugh Dickins's avatar
      mm/thp: try_to_unmap() use TTU_SYNC for safe splitting · 66be14a9
      Hugh Dickins authored
      commit 732ed558 upstream.
      
      Stressing huge tmpfs often crashed on unmap_page()'s VM_BUG_ON_PAGE
      (!unmap_success): with dump_page() showing mapcount:1, but then its raw
      struct page output showing _mapcount ffffffff i.e.  mapcount 0.
      
      And even if that particular VM_BUG_ON_PAGE(!unmap_success) is removed,
      it is immediately followed by a VM_BUG_ON_PAGE(compound_mapcount(head)),
      and further down an IS_ENABLED(CONFIG_DEBUG_VM) total_mapcount BUG():
      all indicative of some mapcount difficulty in development here perhaps.
      But the !CONFIG_DEBUG_VM path handles the failures correctly and
      silently.
      
      I believe the problem is that once a racing unmap has cleared pte or
      pmd, try_to_unmap_one() may skip taking the page table lock, and emerge
      from try_to_unmap() before the racing task has reached decrementing
      mapcount.
      
      Instead of abandoning the unsafe VM_BUG_ON_PAGE(), and the ones that
      follow, use PVMW_SYNC in try_to_unmap_one() in this case: adding
      TTU_SYNC to the options, and passing that from unmap_page().
      
      When CONFIG_DEBUG_VM, or for non-debug too? Consensus is to do the same
      for both: the slight overhead added should rarely matter, except perhaps
      if splitting sparsely-populated multiply-mapped shmem.  Once confident
      that bugs are fixed, TTU_SYNC here can be removed, and the race
      tolerated.
      
      Link: https://lkml.kernel.org/r/c1e95853-8bcd-d8fd-55fa-e7f2488e78f@google.com
      Fixes: fec89c10
      
       ("thp: rewrite freeze_page()/unfreeze_page() with generic rmap walkers")
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jue Wang <juew@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Wang Yugui <wangyugui@e16-tech.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      66be14a9
    • Hugh Dickins's avatar
      mm/thp: make is_huge_zero_pmd() safe and quicker · 6527d8ef
      Hugh Dickins authored
      commit 3b77e8c8 upstream.
      
      Most callers of is_huge_zero_pmd() supply a pmd already verified
      present; but a few (notably zap_huge_pmd()) do not - it might be a pmd
      migration entry, in which the pfn is encoded differently from a present
      pmd: which might pass the is_huge_zero_pmd() test (though not on x86,
      since L1TF forced us to protect against that); or perhaps even crash in
      pmd_page() applied to a swap-like entry.
      
      Make it safe by adding pmd_present() check into is_huge_zero_pmd()
      itself; and make it quicker by saving huge_zero_pfn, so that
      is_huge_zero_pmd() will not need to do that pmd_page() lookup each time.
      
      __split_huge_pmd_locked() checked pmd_trans_huge() before: that worked,
      but is unnecessary now that is_huge_zero_pmd() checks present.
      
      Link: https://lkml.kernel.org/r/21ea9ca-a1f5-8b90-5e88-95fb1c49bbfa@google.com
      Fixes: e71769ae
      
       ("mm: enable thp migration for shmem thp")
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jue Wang <juew@google.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Wang Yugui <wangyugui@e16-tech.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6527d8ef
    • Hugh Dickins's avatar
      mm/thp: fix __split_huge_pmd_locked() on shmem migration entry · a8f4ea1d
      Hugh Dickins authored
      [ Upstream commit 99fa8a48 ]
      
      Patch series "mm/thp: fix THP splitting unmap BUGs and related", v10.
      
      Here is v2 batch of long-standing THP bug fixes that I had not got
      around to sending before, but prompted now by Wang Yugui's report
      https://lore.kernel.org/linux-mm/20210412180659.B9E3.409509F4@e16-tech.com/
      
      Wang Yugui has tested a rollup of these fixes applied to 5.10.39, and
      they have done no harm, but have *not* fixed that issue: something more
      is needed and I have no idea of what.
      
      This patch (of 7):
      
      Stressing huge tmpfs page migration racing hole punch often crashed on
      the VM_BUG_ON(!pmd_present) in pmdp_huge_clear_flush(), with DEBUG_VM=y
      kernel; or shortly afterwards, on a bad dereference in
      __split_huge_pmd_locked() when DEBUG_VM=n.  They forgot to allow for pmd
      migration entries in the non-anonymous case.
      
      Full disclosure: those particular experiments were on a kernel with more
      relaxed mmap_lock and i_mmap_rwsem locking, and were not repeated on the
      vanilla kernel: it is conceivable that stricter locking happens to avoid
      those cases, or makes them less likely; but __split_huge_pmd_locked()
      already allowed for pmd migration entries when handling anonymous THPs,
      so this commit brings the shmem and file THP handling into line.
      
      And while there: use old_pmd rather than _pmd, as in the following
      blocks; and make it clearer to the eye that the !vma_is_anonymous()
      block is self-contained, making an early return after accounting for
      unmapping.
      
      Link: https://lkml.kernel.org/r/af88612-1473-2eaa-903-8d1a448b26@google.com
      Link: https://lkml.kernel.org/r/dd221a99-efb3-cd1d-6256-7e646af29314@google.com
      Fixes: e71769ae
      
       ("mm: enable thp migration for shmem thp")
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Wang Yugui <wangyugui@e16-tech.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Jue Wang <juew@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      
      Note on stable backport: this commit made intervening cleanups in
      pmdp_huge_clear_flush() redundant: here it's rediffed to skip them.
      
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a8f4ea1d
    • Xu Yu's avatar
      mm, thp: use head page in __migration_entry_wait() · 32f954e9
      Xu Yu authored
      commit ffc90cbb upstream.
      
      We notice that hung task happens in a corner but practical scenario when
      CONFIG_PREEMPT_NONE is enabled, as follows.
      
      Process 0                       Process 1                     Process 2..Inf
      split_huge_page_to_list
          unmap_page
              split_huge_pmd_address
                                      __migration_entry_wait(head)
                                                                    __migration_entry_wait(tail)
          remap_page (roll back)
              remove_migration_ptes
                  rmap_walk_anon
                      cond_resched
      
      Where __migration_entry_wait(tail) is occurred in kernel space, e.g.,
      copy_to_user in fstat, which will immediately fault again without
      rescheduling, and thus occupy the cpu fully.
      
      When there are too many processes performing __migration_entry_wait on
      tail page, remap_page will never be done after cond_resched.
      
      This makes __migration_entry_wait operate on the compound head page,
      thus waits for remap_page to complete, whether the THP is split
      successfully or roll back.
      
      Note that put_and_wait_on_page_locked helps to drop the page reference
      acquired with get_page_unless_zero, as soon as the page is on the wait
      queue, before actually waiting.  So splitting the THP is only prevented
      for a brief interval.
      
      Link: https://lkml.kernel.org/r/b9836c1dd522e903891760af9f0c86a2cce987eb.1623144009.git.xuyu@linux.alibaba.com
      Fixes: ba988280
      
       ("thp: add option to setup migration entries during PMD split")
      Suggested-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarGang Deng <gavin.dg@linux.alibaba.com>
      Signed-off-by: default avatarXu Yu <xuyu@linux.alibaba.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      32f954e9
    • Miaohe Lin's avatar
      mm/rmap: use page_not_mapped in try_to_unmap() · bfd90b56
      Miaohe Lin authored
      [ Upstream commit b7e188ec
      
       ]
      
      page_mapcount_is_zero() calculates accurately how many mappings a hugepage
      has in order to check against 0 only.  This is a waste of cpu time.  We
      can do this via page_not_mapped() to save some possible atomic_read
      cycles.  Remove the function page_mapcount_is_zero() as it's not used
      anymore and move page_not_mapped() above try_to_unmap() to avoid
      identifier undeclared compilation error.
      
      Link: https://lkml.kernel.org/r/20210130084904.35307-1-linmiaohe@huawei.com
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bfd90b56
    • Miaohe Lin's avatar
      mm/rmap: remove unneeded semicolon in page_not_mapped() · ff81af82
      Miaohe Lin authored
      [ Upstream commit e0af87ff
      
       ]
      
      Remove extra semicolon without any functional change intended.
      
      Link: https://lkml.kernel.org/r/20210127093425.39640-1-linmiaohe@huawei.com
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ff81af82
    • Alex Shi's avatar
      mm: add VM_WARN_ON_ONCE_PAGE() macro · a0ad7ea0
      Alex Shi authored
      [ Upstream commit a4055888
      
       part ]
      
      Add VM_WARN_ON_ONCE_PAGE() macro.
      
      Link: https://lkml.kernel.org/r/1604283436-18880-3-git-send-email-alex.shi@linux.alibaba.com
      Signed-off-by: default avatarAlex Shi <alex.shi@linux.alibaba.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      
      Note on stable backport: original commit was titled
      mm/memcg: warning on !memcg after readahead page charged
      which included uses of this macro in mm/memcontrol.c: here omitted.
      
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a0ad7ea0
    • Thomas Gleixner's avatar
      x86/fpu: Make init_fpstate correct with optimized XSAVE · 130a1d76
      Thomas Gleixner authored
      commit f9dfb5e3 upstream.
      
      The XSAVE init code initializes all enabled and supported components with
      XRSTOR(S) to init state. Then it XSAVEs the state of the components back
      into init_fpstate which is used in several places to fill in the init state
      of components.
      
      This works correctly with XSAVE, but not with XSAVEOPT and XSAVES because
      those use the init optimization and skip writing state of components which
      are in init state. So init_fpstate.xsave still contains all zeroes after
      this operation.
      
      There are two ways to solve that:
      
         1) Use XSAVE unconditionally, but that requires to reshuffle the buffer when
            XSAVES is enabled because XSAVES uses compacted format.
      
         2) Save the components which are known to have a non-zero init state by other
            means.
      
      Looking deeper, #2 is the right thing to do because all components the
      kernel supports have all-zeroes init state except the legacy features (FP,
      SSE). Those cannot be hard coded because the states are not identical on all
      CPUs, but they can be saved with FXSAVE which avoids all conditionals.
      
      Use FXSAVE to save the legacy FP/SSE components in init_fpstate along with
      a BUILD_BUG_ON() which reminds developers to validate that a newly added
      component has all zeroes init state. As a bonus remove the now unused
      copy_xregs_to_kernel_booting() crutch.
      
      The XSAVE and reshuffle method can still be implemented in the unlikely
      case that components are added which have a non-zero init state and no
      other means to save them. For now, FXSAVE is just simple and good enough.
      
        [ bp: Fix a typo or two in the text. ]
      
      Fixes: 6bad06b7
      
       ("x86, xsave: Use xsaveopt in context-switch path when supported")
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Reviewed-by: default avatarBorislav Petkov <bp@suse.de>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20210618143444.587311343@linutronix.de
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      130a1d76
    • Thomas Gleixner's avatar
      x86/fpu: Preserve supervisor states in sanitize_restored_user_xstate() · 51d80117
      Thomas Gleixner authored
      commit 9301982c upstream.
      
      sanitize_restored_user_xstate() preserves the supervisor states only
      when the fx_only argument is zero, which allows unprivileged user space
      to put supervisor states back into init state.
      
      Preserve them unconditionally.
      
       [ bp: Fix a typo or two in the text. ]
      
      Fixes: 5d6b6a6f
      
       ("x86/fpu/xstate: Update sanitize_restored_xstate() for supervisor xstates")
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20210618143444.438635017@linutronix.de
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      51d80117
    • Petr Mladek's avatar
      kthread: prevent deadlock when kthread_mod_delayed_work() races with... · 2b35a4ea
      Petr Mladek authored
      kthread: prevent deadlock when kthread_mod_delayed_work() races with kthread_cancel_delayed_work_sync()
      
      commit 5fa54346 upstream.
      
      The system might hang with the following backtrace:
      
      	schedule+0x80/0x100
      	schedule_timeout+0x48/0x138
      	wait_for_common+0xa4/0x134
      	wait_for_completion+0x1c/0x2c
      	kthread_flush_work+0x114/0x1cc
      	kthread_cancel_work_sync.llvm.16514401384283632983+0xe8/0x144
      	kthread_cancel_delayed_work_sync+0x18/0x2c
      	xxxx_pm_notify+0xb0/0xd8
      	blocking_notifier_call_chain_robust+0x80/0x194
      	pm_notifier_call_chain_robust+0x28/0x4c
      	suspend_prepare+0x40/0x260
      	enter_state+0x80/0x3f4
      	pm_suspend+0x60/0xdc
      	state_store+0x108/0x144
      	kobj_attr_store+0x38/0x88
      	sysfs_kf_write+0x64/0xc0
      	kernfs_fop_write_iter+0x108/0x1d0
      	vfs_write+0x2f4/0x368
      	ksys_write+0x7c/0xec
      
      It is caused by the following race between kthread_mod_delayed_work()
      and kthread_cancel_delayed_work_sync():
      
      CPU0				CPU1
      
      Context: Thread A		Context: Thread B
      
      kthread_mod_delayed_work()
        spin_lock()
        __kthread_cancel_work()
           spin_unlock()
           del_timer_sync()
      				kthread_cancel_delayed_work_sync()
      				  spin_lock()
      				  __kthread_cancel_work()
      				    spin_unlock()
      				    del_timer_sync()
      				    spin_lock()
      
      				  work->canceling++
      				  spin_unlock
           spin_lock()
         queue_delayed_work()
           // dwork is put into the worker->delayed_work_list
      
         spin_unlock()
      
      				  kthread_flush_work()
           // flush_work is put at the tail of the dwork
      
      				    wait_for_completion()
      
      Context: IRQ
      
        kthread_delayed_work_timer_fn()
          spin_lock()
          list_del_init(&work->node);
          spin_unlock()
      
      BANG: flush_work is not longer linked and will never get proceed.
      
      The problem is that kthread_mod_delayed_work() checks work->canceling
      flag before canceling the timer.
      
      A simple solution is to (re)check work->canceling after
      __kthread_cancel_work().  But then it is not clear what should be
      returned when __kthread_cancel_work() removed the work from the queue
      (list) and it can't queue it again with the new @delay.
      
      The return value might be used for reference counting.  The caller has
      to know whether a new work has been queued or an existing one was
      replaced.
      
      The proper solution is that kthread_mod_delayed_work() will remove the
      work from the queue (list) _only_ when work->canceling is not set.  The
      flag must be checked after the timer is stopped and the remaining
      operations can be done under worker->lock.
      
      Note that kthread_mod_delayed_work() could remove the timer and then
      bail out.  It is fine.  The other canceling caller needs to cancel the
      timer as well.  The important thing is that the queue (list)
      manipulation is done atomically under worker->lock.
      
      Link: https://lkml.kernel.org/r/20210610133051.15337-3-pmladek@suse.com
      Fixes: 9a6b06c8
      
       ("kthread: allow to modify delayed kthread work")
      Signed-off-by: default avatarPetr Mladek <pmladek@suse.com>
      Reported-by: default avatarMartin Liu <liumartin@google.com>
      Cc: <jenhaochen@google.com>
      Cc: Minchan Kim <minchan@google.com>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2b35a4ea
    • Petr Mladek's avatar
      kthread_worker: split code for canceling the delayed work timer · bfe28af7
      Petr Mladek authored
      commit 34b3d534
      
       upstream.
      
      Patch series "kthread_worker: Fix race between kthread_mod_delayed_work()
      and kthread_cancel_delayed_work_sync()".
      
      This patchset fixes the race between kthread_mod_delayed_work() and
      kthread_cancel_delayed_work_sync() including proper return value
      handling.
      
      This patch (of 2):
      
      Simple code refactoring as a preparation step for fixing a race between
      kthread_mod_delayed_work() and kthread_cancel_delayed_work_sync().
      
      It does not modify the existing behavior.
      
      Link: https://lkml.kernel.org/r/20210610133051.15337-2-pmladek@suse.com
      Signed-off-by: default avatarPetr Mladek <pmladek@suse.com>
      Cc: <jenhaochen@google.com>
      Cc: Martin Liu <liumartin@google.com>
      Cc: Minchan Kim <minchan@google.com>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bfe28af7
    • Jeff Layton's avatar
      ceph: must hold snap_rwsem when filling inode for async create · 02c303f3
      Jeff Layton authored
      commit 27171ae6 upstream.
      
      ...and add a lockdep assertion for it to ceph_fill_inode().
      
      Cc: stable@vger.kernel.org # v5.7+
      Fixes: 9a8d03ca
      
       ("ceph: attempt to do async create when possible")
      Signed-off-by: default avatarJeff Layton <jlayton@kernel.org>
      Reviewed-by: default avatarIlya Dryomov <idryomov@gmail.com>
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      02c303f3
    • Johan Hovold's avatar
      i2c: robotfuzz-osif: fix control-request directions · de0af265
      Johan Hovold authored
      commit 4ca070ef
      
       upstream.
      
      The direction of the pipe argument must match the request-type direction
      bit or control requests may fail depending on the host-controller-driver
      implementation.
      
      Control transfers without a data stage are treated as OUT requests by
      the USB stack and should be using usb_sndctrlpipe(). Failing to do so
      will now trigger a warning.
      
      Fix the OSIFI2C_SET_BIT_RATE and OSIFI2C_STOP requests which erroneously
      used the osif_usb_read() helper and set the IN direction bit.
      
      Reported-by: default avatar <syzbot+9d7dadd15b8819d73f41@syzkaller.appspotmail.com>
      Fixes: 83e53a8f
      
       ("i2c: Add bus driver for for OSIF USB i2c device.")
      Cc: stable@vger.kernel.org      # 3.14
      Signed-off-by: default avatarJohan Hovold <johan@kernel.org>
      Signed-off-by: default avatarWolfram Sang <wsa@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      de0af265
    • Nicholas Piggin's avatar
      KVM: do not allow mapping valid but non-reference-counted pages · dd8ed6c9
      Nicholas Piggin authored
      commit f8be156b
      
       upstream.
      
      It's possible to create a region which maps valid but non-refcounted
      pages (e.g., tail pages of non-compound higher order allocations). These
      host pages can then be returned by gfn_to_page, gfn_to_pfn, etc., family
      of APIs, which take a reference to the page, which takes it from 0 to 1.
      When the reference is dropped, this will free the page incorrectly.
      
      Fix this by only taking a reference on valid pages if it was non-zero,
      which indicates it is participating in normal refcounting (and can be
      released with put_page).
      
      This addresses CVE-2021-22543.
      
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Tested-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dd8ed6c9
    • Heiko Carstens's avatar
      s390/stack: fix possible register corruption with stack switch helper · 5fd0c2cf
      Heiko Carstens authored
      commit 67147e96 upstream.
      
      The CALL_ON_STACK macro is used to call a C function from inline
      assembly, and therefore must consider the C ABI, which says that only
      registers 6-13, and 15 are non-volatile (restored by the called
      function).
      
      The inline assembly incorrectly marks all registers used to pass
      parameters to the called function as read-only input operands, instead
      of operands that are read and written to. This might result in
      register corruption depending on usage, compiler, and compile options.
      
      Fix this by marking all operands used to pass parameters as read/write
      operands. To keep the code simple even register 6, if used, is marked
      as read-write operand.
      
      Fixes: ff340d24
      
       ("s390: add stack switch helper")
      Cc: <stable@kernel.org> # 4.20
      Reviewed-by: default avatarVasily Gorbik <gor@linux.ibm.com>
      Signed-off-by: default avatarHeiko Carstens <hca@linux.ibm.com>
      Signed-off-by: default avatarVasily Gorbik <gor@linux.ibm.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5fd0c2cf