Skip to content
  1. Aug 19, 2023
    • Andrew Donnellan's avatar
      lib/test_meminit: allocate pages up to order MAX_ORDER · efb78fa8
      Andrew Donnellan authored
      test_pages() tests the page allocator by calling alloc_pages() with
      different orders up to order 10.
      
      However, different architectures and platforms support different maximum
      contiguous allocation sizes.  The default maximum allocation order
      (MAX_ORDER) is 10, but architectures can use CONFIG_ARCH_FORCE_MAX_ORDER
      to override this.  On platforms where this is less than 10, test_meminit()
      will blow up with a WARN().  This is expected, so let's not do that.
      
      Replace the hardcoded "10" with the MAX_ORDER macro so that we test
      allocations up to the expected platform limit.
      
      Link: https://lkml.kernel.org/r/20230714015238.47931-1-ajd@linux.ibm.com
      Fixes: 5015a300
      
       ("lib: introduce test_meminit module")
      Signed-off-by: default avatarAndrew Donnellan <ajd@linux.ibm.com>
      Reviewed-by: default avatarAlexander Potapenko <glider@google.com>
      Cc: Xiaoke Wang <xkernel.wang@foxmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      efb78fa8
    • Kemeng Shi's avatar
      mm/page_ext: move functions around for minor cleanups to page_ext · eb0da7f6
      Kemeng Shi authored
      
      
      1. move page_ext_get and page_ext_put down to remove forward
         declaration of lookup_page_ext.
      
      2. move page_ext_init_flatmem_late down to existing non SPARS block to
         remove a new non SPARS block and to keep code for non SPARS tight.
      
      Link: https://lkml.kernel.org/r/20230714114749.1743032-4-shikemeng@huaweicloud.com
      Signed-off-by: default avatarKemeng Shi <shikemeng@huaweicloud.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      eb0da7f6
    • Kemeng Shi's avatar
      mm/page_ext: remove rollback for untouched mem_section in online_page_ext · 3c09be5a
      Kemeng Shi authored
      
      
      If init_section_page_ext failed, we only need rollback for mem_section
      before failed mem_section.  Make rollback end point to failed mem_section
      to remove unnecessary rollback.
      
      As pfn += PAGES_PER_SECTION will be executed even if init_section_page_ext
      failed.  So pfn points to mem_section after failed mem_section.  Subtract
      one mem_section from pfn to get failed mem_section.
      
      Link: https://lkml.kernel.org/r/20230714114749.1743032-3-shikemeng@huaweicloud.com
      Signed-off-by: default avatarKemeng Shi <shikemeng@huaweicloud.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3c09be5a
    • Kemeng Shi's avatar
      mm/page_ext: remove unused return value of offline_page_ext · 063ff7cd
      Kemeng Shi authored
      
      
      Patch series "minor cleanups for page_ext".
      
      This series contains some random minor cleanups for page_ext.  More
      details can be found in respective patches.  
      
      
      This patch (of 3):
      
      offline_page_ext always returns 0 and no caller checks the return value. 
      Just remove unused return value of offline_page_ext.
      
      Link: https://lkml.kernel.org/r/20230714114749.1743032-1-shikemeng@huaweicloud.com
      Link: https://lkml.kernel.org/r/20230714114749.1743032-2-shikemeng@huaweicloud.com
      Signed-off-by: default avatarKemeng Shi <shikemeng@huaweicloud.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      063ff7cd
    • Matthew Wilcox (Oracle)'s avatar
      buffer: remove set_bh_page() · 5f6d2862
      Matthew Wilcox (Oracle) authored
      
      
      With all users converted to folio_set_bh(), remove this function.
      
      Link: https://lkml.kernel.org/r/20230713035512.4139457-8-willy@infradead.org
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: David Sterba <dsterba@suse.com>
      Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Pankaj Raghav <p.raghav@samsung.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Tom Rix <trix@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5f6d2862
    • Matthew Wilcox (Oracle)'s avatar
      jbd2: use a folio in jbd2_journal_write_metadata_buffer() · 8147c4c4
      Matthew Wilcox (Oracle) authored
      
      
      The primary goal here is removing the use of set_bh_page().  Take the
      opportunity to switch from kmap_atomic() to kmap_local().  This simplifies
      the function as the offset is already added to the pointer.
      
      Link: https://lkml.kernel.org/r/20230713035512.4139457-7-willy@infradead.org
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: David Sterba <dsterba@suse.com>
      Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Pankaj Raghav <p.raghav@samsung.com>
      Cc: Tom Rix <trix@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8147c4c4
    • Matthew Wilcox (Oracle)'s avatar
      ntfs3: convert ntfs_get_block_vbo() to use a folio · 07811230
      Matthew Wilcox (Oracle) authored
      
      
      Remove a user of set_bh_page().
      
      Link: https://lkml.kernel.org/r/20230713035512.4139457-6-willy@infradead.org
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Tom Rix <trix@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: David Sterba <dsterba@suse.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: Pankaj Raghav <p.raghav@samsung.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      07811230
    • Matthew Wilcox (Oracle)'s avatar
      migrate: use folio_set_bh() instead of set_bh_page() · d5db4f9d
      Matthew Wilcox (Oracle) authored
      
      
      This function was converted before folio_set_bh() existed.  Catch up to
      the new API.
      
      Link: https://lkml.kernel.org/r/20230713035512.4139457-5-willy@infradead.org
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: David Sterba <dsterba@suse.com>
      Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Pankaj Raghav <p.raghav@samsung.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Tom Rix <trix@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d5db4f9d
    • Matthew Wilcox (Oracle)'s avatar
      affs: convert data read and write to use folios · 34113026
      Matthew Wilcox (Oracle) authored
      
      
      We still need to convert to/from folios in write_begin & write_end to fit
      the API, but this removes a lot of calls to old page-based functions,
      removing many hidden calls to compound_head().
      
      Link: https://lkml.kernel.org/r/20230713035512.4139457-4-willy@infradead.org
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarPankaj Raghav <p.raghav@samsung.com>
      Acked-by: default avatarDavid Sterba <dsterba@suse.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Jan Kara <jack@suse.com>
      Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Tom Rix <trix@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      34113026
    • Matthew Wilcox (Oracle)'s avatar
      affs: convert affs_symlink_read_folio() to use the folio · 41a638a1
      Matthew Wilcox (Oracle) authored
      
      
      Remove use of the old page APIs.  That includes use of setting PageError
      on error; simply not setting the uptodate flag is sufficient.
      
      Link: https://lkml.kernel.org/r/20230713035512.4139457-3-willy@infradead.org
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Acked-by: default avatarDavid Sterba <dsterba@suse.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Jan Kara <jack@suse.com>
      Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Pankaj Raghav <p.raghav@samsung.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Tom Rix <trix@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      41a638a1
    • Matthew Wilcox (Oracle)'s avatar
      highmem: add memcpy_to_folio() and memcpy_from_folio() · b23d03ef
      Matthew Wilcox (Oracle) authored
      
      
      Patch series "More filesystem folio conversions for 6.6".
      
      Remove the only spots in affs which actually use a struct page; there
      are a few places where one is mentioned, but it's part of the interface.
      
      The rest of this is removing the remaining calls to set_bh_page(),
      and then removing the function before any new users show up.
      
      
      This patch (of 7):
      
      These are the folio equivalent of memcpy_to_page() and memcpy_from_page().
      
      [agruenba@redhat.com: use correct chunk size in memcpy()]
        Link: https://lkml.kernel.org/r/20230802144354.1023099-1-agruenba@redhat.com
      Link: https://lkml.kernel.org/r/20230713035512.4139457-1-willy@infradead.org
      Link: https://lkml.kernel.org/r/20230713035512.4139457-2-willy@infradead.org
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: David Sterba <dsterba@suse.com>
      Cc: Jan Kara <jack@suse.com>
      Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
      Cc: Pankaj Raghav <p.raghav@samsung.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Tom Rix <trix@redhat.com>
      Cc: Andreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b23d03ef
    • Kemeng Shi's avatar
      mm/page_table_check: remove unused parameter in [__]page_table_check_pud_set · 6d144436
      Kemeng Shi authored
      
      
      Remove unused addr in __page_table_check_pud_set and
      page_table_check_pud_set.
      
      Link: https://lkml.kernel.org/r/20230713172636.1705415-9-shikemeng@huaweicloud.com
      Signed-off-by: default avatarKemeng Shi <shikemeng@huaweicloud.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6d144436
    • Kemeng Shi's avatar
      mm/page_table_check: remove unused parameter in [__]page_table_check_pmd_set · a3b83713
      Kemeng Shi authored
      
      
      Remove unused addr in __page_table_check_pmd_set and
      page_table_check_pmd_set.
      
      Link: https://lkml.kernel.org/r/20230713172636.1705415-8-shikemeng@huaweicloud.com
      Signed-off-by: default avatarKemeng Shi <shikemeng@huaweicloud.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a3b83713
    • Kemeng Shi's avatar
      mm/page_table_check: remove unused parameter in [__]page_table_check_pte_set · 1066293d
      Kemeng Shi authored
      
      
      Remove unused addr in __page_table_check_pte_set and
      page_table_check_pte_set.
      
      Link: https://lkml.kernel.org/r/20230713172636.1705415-7-shikemeng@huaweicloud.com
      Signed-off-by: default avatarKemeng Shi <shikemeng@huaweicloud.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1066293d
    • Kemeng Shi's avatar
      mm/page_table_check: remove unused parameter in [__]page_table_check_pud_clear · 931c38e1
      Kemeng Shi authored
      
      
      Remove unused addr in __page_table_check_pud_clear and
      page_table_check_pud_clear.
      
      Link: https://lkml.kernel.org/r/20230713172636.1705415-6-shikemeng@huaweicloud.com
      Signed-off-by: default avatarKemeng Shi <shikemeng@huaweicloud.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      931c38e1
    • Kemeng Shi's avatar
      mm/page_table_check: remove unused parameter in [__]page_table_check_pmd_clear · 1831414c
      Kemeng Shi authored
      
      
      Remove unused addr in page_table_check_pmd_clear and
      __page_table_check_pmd_clear.
      
      Link: https://lkml.kernel.org/r/20230713172636.1705415-5-shikemeng@huaweicloud.com
      Signed-off-by: default avatarKemeng Shi <shikemeng@huaweicloud.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1831414c
    • Kemeng Shi's avatar
      mm/page_table_check: remove unused parameter in [__]page_table_check_pte_clear · aa232204
      Kemeng Shi authored
      
      
      Remove unused addr in page_table_check_pte_clear and
      __page_table_check_pte_clear.
      
      Link: https://lkml.kernel.org/r/20230713172636.1705415-4-shikemeng@huaweicloud.com
      Signed-off-by: default avatarKemeng Shi <shikemeng@huaweicloud.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      aa232204
    • Kemeng Shi's avatar
      mm/page_table_check: remove unused parameters in page_table_check_set() · 2f933eaf
      Kemeng Shi authored
      
      
      Remove unused mm and addr in page_table_check_set().
      
      Link: https://lkml.kernel.org/r/20230713172636.1705415-3-shikemeng@huaweicloud.com
      Signed-off-by: default avatarKemeng Shi <shikemeng@huaweicloud.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2f933eaf
    • Kemeng Shi's avatar
      mm/page_table_check: remove unused parameters in page_table_check_clear() · 34c876ce
      Kemeng Shi authored
      
      
      Patch series "Remove unused parameters in page_table_check".
      
      This series remove unused parameters in functions from page_table_check. 
      The first 2 patches remove unused mm and addr parameters in static common
      functions page_table_check_clear and page_table_check_set.  The last 6
      patches remove unused addr parameter in some externed functions which only
      need addr for cleaned page_table_check_clear or page_table_check_set. 
      There is no intended functional change.  
      
      
      This patch (of 8):
      
      Remove unused mm and addr in function page_table_check_clear().
      
      Link: https://lkml.kernel.org/r/20230713172636.1705415-1-shikemeng@huaweicloud.com
      Link: https://lkml.kernel.org/r/20230713172636.1705415-2-shikemeng@huaweicloud.com
      Signed-off-by: default avatarKemeng Shi <shikemeng@huaweicloud.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      34c876ce
    • Miaohe Lin's avatar
      mm/memcg: fix obsolete comment above MEM_CGROUP_MAX_RECLAIM_LOOPS · f4d005af
      Miaohe Lin authored
      Since commit 5660048c
      
       ("mm: move memcg hierarchy reclaim to generic
      reclaim code"), mem_cgroup_hierarchical_reclaim() is already renamed to
      mem_cgroup_soft_reclaim().  Update the corresponding comment.
      
      Link: https://lkml.kernel.org/r/20230713121432.273381-1-linmiaohe@huawei.com
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <muchun.song@linux.dev>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f4d005af
    • Miaohe Lin's avatar
      mm/huge_memory: use RMAP_NONE when calling page_add_anon_rmap() · 5ba72b4d
      Miaohe Lin authored
      
      
      It's more convenient and readable to use RMAP_NONE instead of false when
      calling page_add_anon_rmap().  No functional change intended.
      
      Link: https://lkml.kernel.org/r/20230713120557.218592-1-linmiaohe@huawei.com
      Signed-off-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5ba72b4d
    • Jiaqi Yan's avatar
      selftests/mm: add tests for HWPOISON hugetlbfs read · ba91e7e5
      Jiaqi Yan authored
      
      
      Add tests for the improvement made to read operation on HWPOISON
      hugetlb page with different read granularities. For each chunk size,
      three read scenarios are tested:
      1. Simple regression test on read without HWPOISON.
      2. Sequential read page by page should succeed until encounters the 1st
         raw HWPOISON subpage.
      3. After skip a raw HWPOISON subpage by lseek, read()s always succeed.
      
      Link: https://lkml.kernel.org/r/20230713001833.3778937-5-jiaqiyan@google.com
      Signed-off-by: default avatarJiaqi Yan <jiaqiyan@google.com>
      Acked-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ba91e7e5
    • Jiaqi Yan's avatar
      hugetlbfs: improve read HWPOISON hugepage · 38c1ddbd
      Jiaqi Yan authored
      
      
      When a hugepage contains HWPOISON pages, read() fails to read any byte of
      the hugepage and returns -EIO, although many bytes in the HWPOISON
      hugepage are readable.
      
      Improve this by allowing hugetlbfs_read_iter returns as many bytes as
      possible.  For a requested range [offset, offset + len) that contains
      HWPOISON page, return [offset, first HWPOISON page addr); the next read
      attempt will fail and return -EIO.
      
      Link: https://lkml.kernel.org/r/20230713001833.3778937-4-jiaqiyan@google.com
      Signed-off-by: default avatarJiaqi Yan <jiaqiyan@google.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      38c1ddbd
    • Jiaqi Yan's avatar
      mm/hwpoison: check if a raw page in a hugetlb folio is raw HWPOISON · b79f8eb4
      Jiaqi Yan authored
      
      
      Add the functionality, is_raw_hwpoison_page_in_hugepage, to tell if a raw
      page in a hugetlb folio is HWPOISON.  This functionality relies on
      RawHwpUnreliable to be not set; otherwise hugepage's raw HWPOISON list
      becomes meaningless.
      
      is_raw_hwpoison_page_in_hugepage holds mf_mutex in order to synchronize
      with folio_set_hugetlb_hwpoison and folio_free_raw_hwp who iterate,
      insert, or delete entry in raw_hwp_list.  llist itself doesn't ensure
      insertion and removal are synchornized with the llist_for_each_entry used
      by is_raw_hwpoison_page_in_hugepage (unless iterated entries are already
      deleted from the list).  Caller can minimize the overhead of lock cycles
      by first checking HWPOISON flag of the folio.
      
      Exports this functionality to be immediately used in the read operation
      for hugetlbfs.
      
      Link: https://lkml.kernel.org/r/20230713001833.3778937-3-jiaqiyan@google.com
      Signed-off-by: default avatarJiaqi Yan <jiaqiyan@google.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b79f8eb4
    • Jiaqi Yan's avatar
      mm/hwpoison: delete all entries before traversal in __folio_free_raw_hwp · 9e130c4b
      Jiaqi Yan authored
      Patch series "Improve hugetlbfs read on HWPOISON hugepages", v4.
      
      Today when hardware memory is corrupted in a hugetlb hugepage, kernel
      leaves the hugepage in pagecache [1]; otherwise future mmap or read will
      suject to silent data corruption.  This is implemented by returning -EIO
      from hugetlb_read_iter immediately if the hugepage has HWPOISON flag set.
      
      Since memory_failure already tracks the raw HWPOISON subpages in a
      hugepage, a natural improvement is possible: if userspace only asks for
      healthy subpages in the pagecache, kernel can return these data.
      
      This patchset implements this improvement.  It consist of three parts. 
      The 1st commit exports the functionality to tell if a subpage inside a
      hugetlb hugepage is a raw HWPOISON page.  The 2nd commit teaches
      hugetlbfs_read_iter to return as many healthy bytes as possible.  The 3rd
      commit properly tests this new feature.
      
      [1] commit 8625147c
      
       ("hugetlbfs: don't delete error page from pagecache")
      
      
      This patch (of 4):
      
      Traversal on llist (e.g.  llist_for_each_safe) is only safe AFTER entries
      are deleted from the llist.  Correct the way __folio_free_raw_hwp deletes
      and frees raw_hwp_page entries in raw_hwp_list: first llist_del_all, then
      kfree within llist_for_each_safe.
      
      As of today, concurrent adding, deleting, and traversal on raw_hwp_list
      from hugetlb.c and/or memory-failure.c are fine with each other.  Note
      this is guaranteed partly by the lock-free nature of llist, and partly by
      holding hugetlb_lock and/or mf_mutex.  For example, as llist_del_all is
      lock-free with itself, folio_clear_hugetlb_hwpoison()s from
      __update_and_free_hugetlb_folio and memory_failure won't need explicit
      locking when freeing the raw_hwp_list.  New code that manipulates
      raw_hwp_list must be careful to ensure the concurrency correctness.
      
      Link: https://lkml.kernel.org/r/20230713001833.3778937-1-jiaqiyan@google.com
      Link: https://lkml.kernel.org/r/20230713001833.3778937-2-jiaqiyan@google.com
      Signed-off-by: default avatarJiaqi Yan <jiaqiyan@google.com>
      Acked-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9e130c4b
    • Yu Ma's avatar
      mm/mmap: move vma operations to mm_struct out of the critical section of file mapping lock · 6852c46c
      Yu Ma authored
      UnixBench/Execl represents a class of workload where bash scripts are
      spawned frequently to do some short jobs.  When running multiple parallel
      tasks, hot osq_lock is observed from do_mmap and exit_mmap.  Both of them
      come from load_elf_binary through the call chain
      "execl->do_execveat_common->bprm_execve->load_elf_binary".
      
      In do_mmap,it will call mmap_region to create vma node, initialize it and
      insert it to vma maintain structure in mm_struct and i_mmap tree of the
      mapping file, then increase map_count to record the number of vma nodes
      used.  The hot osq_lock is to protect operations on file's i_mmap tree. 
      For the mm_struct member change like vma insertion and map_count update,
      they do not affect i_mmap tree.  Move those operations out of the lock's
      critical section, to reduce hold time on the lock.
      
      With this change, on Intel Sapphire Rapids 112C/224T platform, based on
      v6.0-rc6, the 160 parallel score improves by 12%.  The patch has no
      obvious performance gain on v6.5-rc1 due to regression of this benchmark
      from this commit f1a79412
      
       (mm: convert
      mm's rss stats into percpu_counter).  Related discussion and conclusion
      can be referred at the mail thread initiated by 0day as below: Link:
      https://lore.kernel.org/linux-mm/a4aa2e13-7187-600b-c628-7e8fb108def0@intel.com/
      
      Link: https://lkml.kernel.org/r/20230712145739.604215-1-yu.ma@intel.com
      Signed-off-by: default avatarYu Ma <yu.ma@intel.com>
      Reviewed-by: default avatarTim Chen <tim.c.chen@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Kirill A . Shutemov <kirill@shutemov.name>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Zhu, Lipeng <lipeng.zhu@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6852c46c
    • Xueshi Hu's avatar
      mm: remove clear_page_idle() · 73e791d7
      Xueshi Hu authored
      
      
      All callers have now been converted to call folio_clear_idle().
      
      Link: https://lkml.kernel.org/r/20230712134959.145373-1-xueshi.hu@smartx.com
      Signed-off-by: default avatarXueshi Hu <xueshi.hu@smartx.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Charan Teja Kalla <quic_charante@quicinc.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      73e791d7
    • Hugh Dickins's avatar
      mm/pgtable: notes on pte_offset_map[_lock]() · 610d0657
      Hugh Dickins authored
      
      
      Add a block of comments on pte_offset_map_lock(), pte_offset_map() and
      pte_offset_map_nolock() to mm/pgtable-generic.c, to help explain them.
      
      Link: https://lkml.kernel.org/r/b791c3b0-25c6-a263-d785-d564344eb644@google.com
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huang, Ying <ying.huang@intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <song@kernel.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Zack Rusin <zackr@vmware.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      610d0657
    • Hugh Dickins's avatar
      mm: delete mmap_write_trylock() and vma_try_start_write() · cf95e337
      Hugh Dickins authored
      
      
      mmap_write_trylock() and vma_try_start_write() were added just for
      khugepaged, but now it has no use for them: delete.
      
      Link: https://lkml.kernel.org/r/4e6db3d-e8e-73fb-1f2a-8de2dab2a87c@google.com
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huang, Ying <ying.huang@intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <song@kernel.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Zack Rusin <zackr@vmware.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      cf95e337
    • Hugh Dickins's avatar
      mm/khugepaged: delete khugepaged_collapse_pte_mapped_thps() · d50791c2
      Hugh Dickins authored
      
      
      Now that retract_page_tables() can retract page tables reliably, without
      depending on trylocks, delete all the apparatus for khugepaged to try
      again later: khugepaged_collapse_pte_mapped_thps() etc; and free up the
      per-mm memory which was set aside for that in the khugepaged_mm_slot.
      
      But one part of that is worth keeping: when hpage_collapse_scan_file()
      found SCAN_PTE_MAPPED_HUGEPAGE, that address was noted in the mm_slot to
      be tried for retraction later - catching, for example, page tables where a
      reversible mprotect() of a portion had required splitting the pmd, but now
      it can be recollapsed.  Call collapse_pte_mapped_thp() directly in this
      case (why was it deferred before?  I assume an issue with needing
      mmap_lock for write, but now it's only needed for read).
      
      [hughd@google.com: fix mmap_locked handlng]
        Link: https://lkml.kernel.org/r/bfc6cab2-497f-32bf-dd5-98dc1987e4a9@google.com
      Link: https://lkml.kernel.org/r/a5dce57-6dfa-5559-4698-e817eb2f993@google.com
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huang, Ying <ying.huang@intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <song@kernel.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Zack Rusin <zackr@vmware.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d50791c2
    • Hugh Dickins's avatar
      mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock() · 1043173e
      Hugh Dickins authored
      
      
      Bring collapse_and_free_pmd() back into collapse_pte_mapped_thp().  It
      does need mmap_read_lock(), but it does not need mmap_write_lock(), nor
      vma_start_write() nor i_mmap lock nor anon_vma lock.  All racing paths are
      relying on pte_offset_map_lock() and pmd_lock(), so use those.
      
      Follow the pattern in retract_page_tables(); and using pte_free_defer()
      removes most of the need for tlb_remove_table_sync_one() here; but call
      pmdp_get_lockless_sync() to use it in the PAE case.
      
      First check the VMA, in case page tables are being torn down: from JannH. 
      Confirm the preliminary find_pmd_or_thp_or_none() once page lock has been
      acquired and the page looks suitable: from then on its state is stable.
      
      However, collapse_pte_mapped_thp() was doing something others don't:
      freeing a page table still containing "valid" entries.  i_mmap lock did
      stop a racing truncate from double-freeing those pages, but we prefer
      collapse_pte_mapped_thp() to clear the entries as usual.  Their TLB flush
      can wait until the pmdp_collapse_flush() which follows, but the
      mmu_notifier_invalidate_range_start() has to be done earlier.
      
      Do the "step 1" checking loop without mmu_notifier: it wouldn't be good
      for khugepaged to keep on repeatedly invalidating a range which is then
      found unsuitable e.g.  contains COWs.  "step 2", which does the clearing,
      must then be more careful (after dropping ptl to do mmu_notifier), with
      abort prepared to correct the accounting like "step 3".  But with those
      entries now cleared, "step 4" (after dropping ptl to do pmd_lock) is kept
      safe by the huge page lock, which stops new PTEs from being faulted in.
      
      [hughd@google.com: don't set mmap_locked = true in madvise_collapse()]
        Link: https://lkml.kernel.org/r/d3d9ff14-ef8-8f84-e160-bfa1f5794275@google.com
      [hughd@google.com: use ptep_clear() instead of pte_clear()]
        Link: https://lkml.kernel.org/r/e0197433-8a47-6a65-534d-eda26eeb78b0@google.com
      Link: https://lkml.kernel.org/r/b53be6a4-7715-51f9-aad-f1347dcb7c4@google.com
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarQi Zheng <zhengqi.arch@bytedance.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huang, Ying <ying.huang@intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <song@kernel.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Zack Rusin <zackr@vmware.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1043173e
    • Hugh Dickins's avatar
      mm/khugepaged: retract_page_tables() without mmap or vma lock · 1d65b771
      Hugh Dickins authored
      
      
      Simplify shmem and file THP collapse's retract_page_tables(), and relax
      its locking: to improve its success rate and to lessen impact on others.
      
      Instead of its MADV_COLLAPSE case doing set_huge_pmd() at target_addr of
      target_mm, leave that part of the work to madvise_collapse() calling
      collapse_pte_mapped_thp() afterwards: just adjust collapse_file()'s result
      code to arrange for that.  That spares retract_page_tables() four
      arguments; and since it will be successful in retracting all of the page
      tables expected of it, no need to track and return a result code itself.
      
      It needs i_mmap_lock_read(mapping) for traversing the vma interval tree,
      but it does not need i_mmap_lock_write() for that: page_vma_mapped_walk()
      allows for pte_offset_map_lock() etc to fail, and uses pmd_lock() for
      THPs.  retract_page_tables() just needs to use those same spinlocks to
      exclude it briefly, while transitioning pmd from page table to none: so
      restore its use of pmd_lock() inside of which pte lock is nested.
      
      Users of pte_offset_map_lock() etc all now allow for them to fail: so
      retract_page_tables() now has no use for mmap_write_trylock() or
      vma_try_start_write().  In common with rmap and page_vma_mapped_walk(), it
      does not even need the mmap_read_lock().
      
      But those users do expect the page table to remain a good page table,
      until they unlock and rcu_read_unlock(): so the page table cannot be freed
      immediately, but rather by the recently added pte_free_defer().
      
      Use the (usually a no-op) pmdp_get_lockless_sync() to send an interrupt
      when PAE, and pmdp_collapse_flush() did not already do so: to make sure
      that the start,pmdp_get_lockless(),end sequence in __pte_offset_map()
      cannot pick up a pmd entry with mismatched pmd_low and pmd_high.
      
      retract_page_tables() can be enhanced to replace_page_tables(), which
      inserts the final huge pmd without mmap lock: going through an invalid
      state instead of pmd_none() followed by fault.  But that enhancement does
      raise some more questions: leave it until a later release.
      
      Link: https://lkml.kernel.org/r/f88970d9-d347-9762-ae6d-da978e8a4df@google.com
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huang, Ying <ying.huang@intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <song@kernel.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Zack Rusin <zackr@vmware.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1d65b771
    • Hugh Dickins's avatar
      mm/pgtable: add pte_free_defer() for pgtable as page · 13cf577e
      Hugh Dickins authored
      
      
      Add the generic pte_free_defer(), to call pte_free() via call_rcu(). 
      pte_free_defer() will be called inside khugepaged's retract_page_tables()
      loop, where allocating extra memory cannot be relied upon.  This version
      suits all those architectures which use an unfragmented page for one page
      table (none of whose pte_free()s use the mm arg which was passed to it).
      
      Link: https://lkml.kernel.org/r/78e921b0-b681-a1b0-dc20-44c9efa4ef3c@google.com
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huang, Ying <ying.huang@intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <song@kernel.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Zack Rusin <zackr@vmware.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      13cf577e
    • Hugh Dickins's avatar
      s390: add pte_free_defer() for pgtables sharing page · 8211dad6
      Hugh Dickins authored
      
      
      Add s390-specific pte_free_defer(), to free table page via call_rcu(). 
      pte_free_defer() will be called inside khugepaged's retract_page_tables()
      loop, where allocating extra memory cannot be relied upon.  This precedes
      the generic version to avoid build breakage from incompatible pgtable_t.
      
      This version is more complicated than others: because s390 fits two 2K
      page tables into one 4K page (so page->rcu_head must be shared between
      both halves), and already uses page->lru (which page->rcu_head overlays)
      to list any free halves; with clever management by page->_refcount bits.
      
      Build upon the existing management, adjusted to follow a new rule: that a
      page is never on the free list if pte_free_defer() was used on either half
      (marked by PageActive).  And for simplicity, delay calling RCU until both
      halves are freed.
      
      Not adding back unallocated fragments to the list in pte_free_defer() can
      result in wasting some amount of memory for pagetables, depending on how
      long the allocated fragment will stay in use.  In practice, this effect is
      expected to be insignificant, and not justify a far more complex approach,
      which might allow to add the fragments back later in __tlb_remove_table(),
      where we might not have a stable mm any more.
      
      [hughd@google.com: Claudio finds warning on mm_has_pgste() more useful than on mm_alloc_pgste()]
        Link: https://lkml.kernel.org/r/3bc095ba-a180-ce3b-82b1-2bfc64612f3@google.com
      Link: https://lkml.kernel.org/r/94eccf5f-264c-8abe-4567-e77f4b4e14a@google.com
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarGerald Schaefer <gerald.schaefer@linux.ibm.com>
      Tested-by: default avatarAlexander Gordeev <agordeev@linux.ibm.com>
      Acked-by: default avatarAlexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huang, Ying <ying.huang@intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <song@kernel.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Zack Rusin <zackr@vmware.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8211dad6
    • Hugh Dickins's avatar
      sparc: add pte_free_defer() for pte_t *pgtable_t · ad1ac8d9
      Hugh Dickins authored
      
      
      Add sparc-specific pte_free_defer(), to call pte_free() via call_rcu(). 
      pte_free_defer() will be called inside khugepaged's retract_page_tables()
      loop, where allocating extra memory cannot be relied upon.  This precedes
      the generic version to avoid build breakage from incompatible pgtable_t.
      
      sparc32 supports pagetables sharing a page, but does not support THP;
      sparc64 supports THP, but does not support pagetables sharing a page.  So
      the sparc-specific pte_free_defer() is as simple as the generic one,
      except for converting between pte_t *pgtable_t and struct page *.
      
      Link: https://lkml.kernel.org/r/dc4f318d-a66a-5622-dc44-9018ea814b37@google.com
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huang, Ying <ying.huang@intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <song@kernel.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Zack Rusin <zackr@vmware.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ad1ac8d9
    • Hugh Dickins's avatar
      powerpc: add pte_free_defer() for pgtables sharing page · 32cc0b7c
      Hugh Dickins authored
      
      
      Add powerpc-specific pte_free_defer(), to free table page via call_rcu(). 
      pte_free_defer() will be called inside khugepaged's retract_page_tables()
      loop, where allocating extra memory cannot be relied upon.  This precedes
      the generic version to avoid build breakage from incompatible pgtable_t.
      
      This is awkward because the struct page contains only one rcu_head, but
      that page may be shared between PTE_FRAG_NR pagetables, each wanting to
      use the rcu_head at the same time.  But powerpc never reuses a fragment
      once it has been freed: so mark the page Active in pte_free_defer(),
      before calling pte_fragment_free() directly; and there call_rcu() to
      pte_free_now() when last fragment is freed and the page is PageActive.
      
      Link: https://lkml.kernel.org/r/6e3ca5f1-334d-4b14-b92d-fc8e99914fcb@google.com
      Suggested-by: default avatarJason Gunthorpe <jgg@ziepe.ca>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huang, Ying <ying.huang@intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <song@kernel.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Zack Rusin <zackr@vmware.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      32cc0b7c
    • Hugh Dickins's avatar
      powerpc: assert_pte_locked() use pte_offset_map_nolock() · 3d140215
      Hugh Dickins authored
      
      
      Instead of pte_lockptr(), use the recently added pte_offset_map_nolock()
      in assert_pte_locked().  BUG if pte_offset_map_nolock() fails.
      
      This mod might cause new crashes: which either expose my ignorance, or
      indicate issues to be fixed, or limit the usage of assert_pte_locked().
      
      [hughd@google.com: assert_pte_locked() still needs the pmd_none() check]
        Link: https://lkml.kernel.org/r/c73d1543-532c-3da2-8cf2-a95363a14116@google.com
      Link: https://lkml.kernel.org/r/e8d56c95-c132-a82e-5f5f-7bb1b738b057@google.com
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huang, Ying <ying.huang@intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <song@kernel.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Zack Rusin <zackr@vmware.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3d140215
    • Hugh Dickins's avatar
      arm: adjust_pte() use pte_offset_map_nolock() · de2e4626
      Hugh Dickins authored
      
      
      Instead of pte_lockptr(), use the recently added pte_offset_map_nolock()
      in adjust_pte(): because it gives the not-locked ptl for precisely that
      pte, which the caller can then safely lock; whereas pte_lockptr() is not
      so tightly coupled, because it dereferences the pmd pointer again.
      
      Link: https://lkml.kernel.org/r/4d5258bd-ffa0-018-253a-25f2c9b783f7@google.com
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huang, Ying <ying.huang@intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <song@kernel.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Zack Rusin <zackr@vmware.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      de2e4626
    • Hugh Dickins's avatar
      mm/pgtable: add PAE safety to __pte_offset_map() · 146b42e0
      Hugh Dickins authored
      
      
      There is a faint risk that __pte_offset_map(), on a 32-bit architecture
      with a 64-bit pmd_t e.g.  x86-32 with CONFIG_X86_PAE=y, would succeed on a
      pmdval assembled from a pmd_low and a pmd_high which never belonged
      together: their combination not pointing to a page table at all, perhaps
      not even a valid pfn.  pmdp_get_lockless() is not enough to prevent that.
      
      Guard against that (on such configs) by local_irq_save() blocking TLB
      flush between present updates, as linux/pgtable.h suggests.  It's only
      needed around the pmdp_get_lockless() in __pte_offset_map(): a race when
      __pte_offset_map_lock() repeats the pmdp_get_lockless() after getting the
      lock, would just send it back to __pte_offset_map() again.
      
      Complement this pmdp_get_lockless_start() and pmdp_get_lockless_end(),
      used only locally in __pte_offset_map(), with a pmdp_get_lockless_sync()
      synonym for tlb_remove_table_sync_one(): to send the necessary interrupt
      at the right moment on those configs which do not already send it.
      
      CONFIG_GUP_GET_PXX_LOW_HIGH is enabled when required by mips, sh and x86. 
      It is not enabled by arm-32 CONFIG_ARM_LPAE: my understanding is that Will
      Deacon's 2020 enhancements to READ_ONCE() are sufficient for arm.  It is
      not enabled by arc, but its pmd_t is 32-bit even when pte_t 64-bit.
      
      Limit the IRQ disablement to CONFIG_HIGHPTE?  Perhaps, but would need a
      little more work, to retry if pmd_low good for page table, but pmd_high
      non-zero from THP (and that might be making x86-specific assumptions).
      
      Link: https://lkml.kernel.org/r/3adcd8f-9191-2df1-d7ea-c4877698aad@google.com
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huang, Ying <ying.huang@intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <song@kernel.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Zack Rusin <zackr@vmware.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      146b42e0
    • Hugh Dickins's avatar
      mm/pgtable: add rcu_read_lock() and rcu_read_unlock()s · a349d72f
      Hugh Dickins authored
      
      
      Patch series "mm: free retracted page table by RCU", v3.
      
      Some mmap_lock avoidance i.e.  latency reduction.  Initially just for the
      case of collapsing shmem or file pages to THPs: the usefulness of
      MADV_COLLAPSE on shmem is being limited by that mmap_write_lock it
      currently requires.
      
      Likely to be relied upon later in other contexts e.g.  freeing of empty
      page tables (but that's not work I'm doing).  mmap_write_lock avoidance
      when collapsing to anon THPs?  Perhaps, but again that's not work I've
      done: a quick attempt was not as easy as the shmem/file case.
      
      These changes (though of course not these exact patches) have been in
      Google's data centre kernel for three years now: we do rely upon them.
      
      
      This patch (of 13):
      
      Before putting them to use (several commits later), add rcu_read_lock() to
      pte_offset_map(), and rcu_read_unlock() to pte_unmap().  Make this a
      separate commit, since it risks exposing imbalances: prior commits have
      fixed all the known imbalances, but we may find some have been missed.
      
      Link: https://lkml.kernel.org/r/7cd843a9-aa80-14f-5eb2-33427363c20@google.com
      Link: https://lkml.kernel.org/r/d3b01da5-2a6-833c-6681-67a3e024a16f@google.com
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huang, Ying <ying.huang@intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Lorenzo Stoakes <lstoakes@gmail.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <song@kernel.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Zack Rusin <zackr@vmware.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a349d72f