Skip to content
  1. Jun 19, 2021
  2. Jun 18, 2021
    • Steven Rostedt (VMware)'s avatar
      tracing: Do no increment trace_clock_global() by one · 89529d8b
      Steven Rostedt (VMware) authored
      The trace_clock_global() tries to make sure the events between CPUs is
      somewhat in order. A global value is used and updated by the latest read
      of a clock. If one CPU is ahead by a little, and is read by another CPU, a
      lock is taken, and if the timestamp of the other CPU is behind, it will
      simply use the other CPUs timestamp.
      
      The lock is also only taken with a "trylock" due to tracing, and strange
      recursions can happen. The lock is not taken at all in NMI context.
      
      In the case where the lock is not able to be taken, the non synced
      timestamp is returned. But it will not be less than the saved global
      timestamp.
      
      The problem arises because when the time goes "backwards" the time
      returned is the saved timestamp plus 1. If the lock is not taken, and the
      plus one to the timestamp is returned, there's a small race that can cause
      the time to go backwards!
      
      	CPU0				CPU1
      	----				----
      				trace_clock_global() {
      				    ts = clock() [ 1000 ]
      				    trylock(clock_lock) [ success ]
      				    global_ts = ts; [ 1000 ]
      
      				    <interrupted by NMI>
       trace_clock_global() {
          ts = clock() [ 999 ]
          if (ts < global_ts)
      	ts = global_ts + 1 [ 1001 ]
      
          trylock(clock_lock) [ fail ]
      
          return ts [ 1001]
       }
      				    unlock(clock_lock);
      				    return ts; [ 1000 ]
      				}
      
       trace_clock_global() {
          ts = clock() [ 1000 ]
          if (ts < global_ts) [ false 1000 == 1000 ]
      
          trylock(clock_lock) [ success ]
          global_ts = ts; [ 1000 ]
          unlock(clock_lock)
      
          return ts; [ 1000 ]
       }
      
      The above case shows to reads of trace_clock_global() on the same CPU, but
      the second read returns one less than the first read. That is, time when
      backwards, and this is not what is allowed by trace_clock_global().
      
      This was triggered by heavy tracing and the ring buffer checker that tests
      for the clock going backwards:
      
       Ring buffer clock went backwards: 20613921464 -> 20613921463
       ------------[ cut here ]------------
       WARNING: CPU: 2 PID: 0 at kernel/trace/ring_buffer.c:3412 check_buffer+0x1b9/0x1c0
       Modules linked in:
       [..]
       [CPU: 2]TIME DOES NOT MATCH expected:20620711698 actual:20620711697 delta:6790234 before:20613921463 after:20613921463
         [20613915818] PAGE TIME STAMP
         [20613915818] delta:0
         [20613915819] delta:1
         [20613916035] delta:216
         [20613916465] delta:430
         [20613916575] delta:110
         [20613916749] delta:174
         [20613917248] delta:499
         [20613917333] delta:85
         [20613917775] delta:442
         [20613917921] delta:146
         [20613918321] delta:400
         [20613918568] delta:247
         [20613918768] delta:200
         [20613919306] delta:538
         [20613919353] delta:47
         [20613919980] delta:627
         [20613920296] delta:316
         [20613920571] delta:275
         [20613920862] delta:291
         [20613921152] delta:290
         [20613921464] delta:312
         [20613921464] delta:0 TIME EXTEND
         [20613921464] delta:0
      
      This happened more than once, and always for an off by one result. It also
      started happening after commit aafe104a was added.
      
      Cc: stable@vger.kernel.org
      Fixes: aafe104a
      
       ("tracing: Restructure trace_clock_global() to never block")
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      89529d8b
    • Steven Rostedt (VMware)'s avatar
      tracing: Do not stop recording comms if the trace file is being read · 4fdd595e
      Steven Rostedt (VMware) authored
      A while ago, when the "trace" file was opened, tracing was stopped, and
      code was added to stop recording the comms to saved_cmdlines, for mapping
      of the pids to the task name.
      
      Code has been added that only records the comm if a trace event occurred,
      and there's no reason to not trace it if the trace file is opened.
      
      Cc: stable@vger.kernel.org
      Fixes: 7ffbd48d
      
       ("tracing: Cache comms only after an event occurred")
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      4fdd595e
    • Steven Rostedt (VMware)'s avatar
      tracing: Do not stop recording cmdlines when tracing is off · 85550c83
      Steven Rostedt (VMware) authored
      The saved_cmdlines is used to map pids to the task name, such that the
      output of the tracing does not just show pids, but also gives a human
      readable name for the task.
      
      If the name is not mapped, the output looks like this:
      
          <...>-1316          [005] ...2   132.044039: ...
      
      Instead of this:
      
          gnome-shell-1316    [005] ...2   132.044039: ...
      
      The names are updated when tracing is running, but are skipped if tracing
      is stopped. Unfortunately, this stops the recording of the names if the
      top level tracer is stopped, and not if there's other tracers active.
      
      The recording of a name only happens when a new event is written into a
      ring buffer, so there is no need to test if tracing is on or not. If
      tracing is off, then no event is written and no need to test if tracing is
      off or not.
      
      Remove the check, as it hides the names of tasks for events in the
      instance buffers.
      
      Cc: stable@vger.kernel.org
      Fixes: 7ffbd48d
      
       ("tracing: Cache comms only after an event occurred")
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      85550c83
    • Peter Zijlstra's avatar
      recordmcount: Correct st_shndx handling · fb780761
      Peter Zijlstra authored
      One should only use st_shndx when >SHN_UNDEF and <SHN_LORESERVE. When
      SHN_XINDEX, then use .symtab_shndx. Otherwise use 0.
      
      This handles the case: st_shndx >= SHN_LORESERVE && st_shndx != SHN_XINDEX.
      
      Link: https://lore.kernel.org/lkml/20210607023839.26387-1-mark-pk.tsai@mediatek.com/
      Link: https://lkml.kernel.org/r/20210616154126.2794-1-mark-pk.tsai@mediatek.com
      
      
      
      Reported-by: default avatarMark-PK Tsai <mark-pk.tsai@mediatek.com>
      Tested-by: default avatarMark-PK Tsai <mark-pk.tsai@mediatek.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      [handle endianness of sym->st_shndx]
      Signed-off-by: default avatarMark-PK Tsai <mark-pk.tsai@mediatek.com>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      fb780761
    • Dave Airlie's avatar
      Merge tag 'amd-drm-fixes-5.13-2021-06-16' of... · c55338d3
      Dave Airlie authored
      Merge tag 'amd-drm-fixes-5.13-2021-06-16' of https://gitlab.freedesktop.org/agd5f/linux
      
       into drm-fixes
      
      amd-drm-fixes-5.13-2021-06-16:
      
      amdgpu:
      - GFX9 and 10 powergating fixes
      
      Signed-off-by: default avatarDave Airlie <airlied@redhat.com>
      From: Alex Deucher <alexander.deucher@amd.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20210616204913.4368-1-alexander.deucher@amd.com
      c55338d3
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm · fd0aa1a4
      Linus Torvalds authored
      Pull kvm fixes from Paolo Bonzini:
       "Miscellaneous bugfixes.
      
        The main interesting one is a NULL pointer dereference reported by
        syzkaller ("KVM: x86: Immediately reset the MMU context when the SMM
        flag is cleared")"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
        KVM: selftests: Fix kvm_check_cap() assertion
        KVM: x86/mmu: Calculate and check "full" mmu_role for nested MMU
        KVM: X86: Fix x86_emulator slab cache leak
        KVM: SVM: Call SEV Guest Decommission if ASID binding fails
        KVM: x86: Immediately reset the MMU context when the SMM flag is cleared
        KVM: x86: Fix fall-through warnings for Clang
        KVM: SVM: fix doc warnings
        KVM: selftests: Fix compiling errors when initializing the static structure
        kvm: LAPIC: Restore guard to prevent illegal APIC register access
      fd0aa1a4
    • Fuad Tabba's avatar
      KVM: selftests: Fix kvm_check_cap() assertion · d8ac05ea
      Fuad Tabba authored
      
      
      KVM_CHECK_EXTENSION ioctl can return any negative value on error,
      and not necessarily -1. Change the assertion to reflect that.
      
      Signed-off-by: default avatarFuad Tabba <tabba@google.com>
      Message-Id: <20210615150443.1183365-1-tabba@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d8ac05ea
    • Linus Torvalds's avatar
      Merge tag 'fixes_for_v5.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs · 39519f6a
      Linus Torvalds authored
      Pull quota and fanotify fixes from Jan Kara:
       "A fixup finishing disabling of quotactl_path() syscall (I've missed
        archs using different way to declare syscalls) and a fix of an fd leak
        in error handling path of fanotify"
      
      * tag 'fixes_for_v5.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
        quota: finish disable quotactl_path syscall
        fanotify: fix copy_event_to_user() fid error clean up
      39519f6a
  3. Jun 17, 2021
    • Andrew Lunn's avatar
      usb: core: hub: Disable autosuspend for Cypress CY7C65632 · a7d8d1c7
      Andrew Lunn authored
      The Cypress CY7C65632 appears to have an issue with auto suspend and
      detecting devices, not too dissimilar to the SMSC 5534B hub. It is
      easiest to reproduce by connecting multiple mass storage devices to
      the hub at the same time. On a Lenovo Yoga, around 1 in 3 attempts
      result in the devices not being detected. It is however possible to
      make them appear using lsusb -v.
      
      Disabling autosuspend for this hub resolves the issue.
      
      Fixes: 1208f9e1
      
       ("USB: hub: Fix the broken detection of USB3 device in SMSC hub")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Link: https://lore.kernel.org/r/20210614155524.2228800-1-andrew@lunn.ch
      
      
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a7d8d1c7
    • Yifan Zhang's avatar
      drm/amdgpu/gfx10: enlarge CP_MEC_DOORBELL_RANGE_UPPER to cover full doorbell. · 1c0b0efd
      Yifan Zhang authored
      
      
      If GC has entered CGPG, ringing doorbell > first page doesn't wakeup GC.
      Enlarge CP_MEC_DOORBELL_RANGE_UPPER to workaround this issue.
      
      Signed-off-by: default avatarYifan Zhang <yifan1.zhang@amd.com>
      Reviewed-by: default avatarFelix Kuehling <Felix.Kuehling@amd.com>
      Reviewed-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Cc: stable@vger.kernel.org
      1c0b0efd
    • Yifan Zhang's avatar
      drm/amdgpu/gfx9: fix the doorbell missing when in CGPG issue. · 4cbbe348
      Yifan Zhang authored
      
      
      If GC has entered CGPG, ringing doorbell > first page doesn't wakeup GC.
      Enlarge CP_MEC_DOORBELL_RANGE_UPPER to workaround this issue.
      
      Signed-off-by: default avatarYifan Zhang <yifan1.zhang@amd.com>
      Reviewed-by: default avatarFelix Kuehling <Felix.Kuehling@amd.com>
      Reviewed-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Cc: stable@vger.kernel.org
      4cbbe348
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · 70585216
      Linus Torvalds authored
      Merge misc fixes from Andrew Morton:
       "18 patches.
      
        Subsystems affected by this patch series: mm (memory-failure, swap,
        slub, hugetlb, memory-failure, slub, thp, sparsemem), and coredump"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
        mm/sparse: fix check_usemap_section_nr warnings
        mm: thp: replace DEBUG_VM BUG with VM_WARN when unmap fails for split
        mm/thp: unmap_mapping_page() to fix THP truncate_cleanup_page()
        mm/thp: fix page_address_in_vma() on file THP tails
        mm/thp: fix vma_address() if virtual address below file offset
        mm/thp: try_to_unmap() use TTU_SYNC for safe splitting
        mm/thp: make is_huge_zero_pmd() safe and quicker
        mm/thp: fix __split_huge_pmd_locked() on shmem migration entry
        mm, thp: use head page in __migration_entry_wait()
        mm/slub.c: include swab.h
        crash_core, vmcoreinfo: append 'SECTION_SIZE_BITS' to vmcoreinfo
        mm/memory-failure: make sure wait for page writeback in memory_failure
        mm/hugetlb: expand restore_reserve_on_error functionality
        mm/slub: actually fix freelist pointer vs redzoning
        mm/slub: fix redzoning for small allocations
        mm/slub: clarify verification reporting
        mm/swap: fix pte_same_as_swp() not removing uffd-wp bit when compare
        mm,hwpoison: fix race with hugetlb page allocation
      70585216
    • Miles Chen's avatar
      mm/sparse: fix check_usemap_section_nr warnings · ccbd6283
      Miles Chen authored
      I see a "virt_to_phys used for non-linear address" warning from
      check_usemap_section_nr() on arm64 platforms.
      
      In current implementation of NODE_DATA, if CONFIG_NEED_MULTIPLE_NODES=y,
      pglist_data is dynamically allocated and assigned to node_data[].
      
      For example, in arch/arm64/include/asm/mmzone.h:
      
        extern struct pglist_data *node_data[];
        #define NODE_DATA(nid)          (node_data[(nid)])
      
      If CONFIG_NEED_MULTIPLE_NODES=n, pglist_data is defined as a global
      variable named "contig_page_data".
      
      For example, in include/linux/mmzone.h:
      
        extern struct pglist_data contig_page_data;
        #define NODE_DATA(nid)          (&contig_page_data)
      
      If CONFIG_DEBUG_VIRTUAL is not enabled, __pa() can handle both
      dynamically allocated linear addresses and symbol addresses.  However,
      if (CONFIG_DEBUG_VIRTUAL=y && CONFIG_NEED_MULTIPLE_NODES=n) we can see
      the "virt_to_phys used for non-linear address" warning because that
      &contig_page_data is not a linear address on arm64.
      
      Warning message:
      
        virt_to_phys used for non-linear address: (contig_page_data+0x0/0x1c00)
        WARNING: CPU: 0 PID: 0 at arch/arm64/mm/physaddr.c:15 __virt_to_phys+0x58/0x68
        Modules linked in:
        CPU: 0 PID: 0 Comm: swapper Tainted: G        W         5.13.0-rc1-00074-g1140ab592e2e #3
        Hardware name: linux,dummy-virt (DT)
        pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO BTYPE=--)
        Call trace:
           __virt_to_phys+0x58/0x68
           check_usemap_section_nr+0x50/0xfc
           sparse_init_nid+0x1ac/0x28c
           sparse_init+0x1c4/0x1e0
           bootmem_init+0x60/0x90
           setup_arch+0x184/0x1f0
           start_kernel+0x78/0x488
      
      To fix it, create a small function to handle both translation.
      
      Link: https://lkml.kernel.org/r/1623058729-27264-1-git-send-email-miles.chen@mediatek.com
      
      
      Signed-off-by: default avatarMiles Chen <miles.chen@mediatek.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Kazu <k-hagio-ab@nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ccbd6283
    • Yang Shi's avatar
      mm: thp: replace DEBUG_VM BUG with VM_WARN when unmap fails for split · 504e070d
      Yang Shi authored
      When debugging the bug reported by Wang Yugui [1], try_to_unmap() may
      fail, but the first VM_BUG_ON_PAGE() just checks page_mapcount() however
      it may miss the failure when head page is unmapped but other subpage is
      mapped.  Then the second DEBUG_VM BUG() that check total mapcount would
      catch it.  This may incur some confusion.
      
      As this is not a fatal issue, so consolidate the two DEBUG_VM checks
      into one VM_WARN_ON_ONCE_PAGE().
      
      [1] https://lore.kernel.org/linux-mm/20210412180659.B9E3.409509F4@e16-tech.com/
      
      Link: https://lkml.kernel.org/r/d0f0db68-98b8-ebfb-16dc-f29df24cf012@google.com
      
      
      Signed-off-by: default avatarYang Shi <shy828301@gmail.com>
      Reviewed-by: default avatarZi Yan <ziy@nvidia.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jue Wang <juew@google.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Wang Yugui <wangyugui@e16-tech.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      504e070d
    • Hugh Dickins's avatar
      mm/thp: unmap_mapping_page() to fix THP truncate_cleanup_page() · 22061a1f
      Hugh Dickins authored
      There is a race between THP unmapping and truncation, when truncate sees
      pmd_none() and skips the entry, after munmap's zap_huge_pmd() cleared
      it, but before its page_remove_rmap() gets to decrement
      compound_mapcount: generating false "BUG: Bad page cache" reports that
      the page is still mapped when deleted.  This commit fixes that, but not
      in the way I hoped.
      
      The first attempt used try_to_unmap(page, TTU_SYNC|TTU_IGNORE_MLOCK)
      instead of unmap_mapping_range() in truncate_cleanup_page(): it has
      often been an annoyance that we usually call unmap_mapping_range() with
      no pages locked, but there apply it to a single locked page.
      try_to_unmap() looks more suitable for a single locked page.
      
      However, try_to_unmap_one() contains a VM_BUG_ON_PAGE(!pvmw.pte,page):
      it is used to insert THP migration entries, but not used to unmap THPs.
      Copy zap_huge_pmd() and add THP handling now? Perhaps, but their TLB
      needs are different, I'm too ignorant of the DAX cases, and couldn't
      decide how far to go for anon+swap.  Set that aside.
      
      The second attempt took a different tack: make no change in truncate.c,
      but modify zap_huge_pmd() to insert an invalidated huge pmd instead of
      clearing it initially, then pmd_clear() between page_remove_rmap() and
      unlocking at the end.  Nice.  But powerpc blows that approach out of the
      water, with its serialize_against_pte_lookup(), and interesting pgtable
      usage.  It would need serious help to get working on powerpc (with a
      minor optimization issue on s390 too).  Set that aside.
      
      Just add an "if (page_mapped(page)) synchronize_rcu();" or other such
      delay, after unmapping in truncate_cleanup_page()? Perhaps, but though
      that's likely to reduce or eliminate the number of incidents, it would
      give less assurance of whether we had identified the problem correctly.
      
      This successful iteration introduces "unmap_mapping_page(page)" instead
      of try_to_unmap(), and goes the usual unmap_mapping_range_tree() route,
      with an addition to details.  Then zap_pmd_range() watches for this
      case, and does spin_unlock(pmd_lock) if so - just like
      page_vma_mapped_walk() now does in the PVMW_SYNC case.  Not pretty, but
      safe.
      
      Note that unmap_mapping_page() is doing a VM_BUG_ON(!PageLocked) to
      assert its interface; but currently that's only used to make sure that
      page->mapping is stable, and zap_pmd_range() doesn't care if the page is
      locked or not.  Along these lines, in invalidate_inode_pages2_range()
      move the initial unmap_mapping_range() out from under page lock, before
      then calling unmap_mapping_page() under page lock if still mapped.
      
      Link: https://lkml.kernel.org/r/a2a4a148-cdd8-942c-4ef8-51b77f643dbe@google.com
      Fixes: fc127da0
      
       ("truncate: handle file thp")
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jue Wang <juew@google.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Wang Yugui <wangyugui@e16-tech.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      22061a1f
    • Jue Wang's avatar
      mm/thp: fix page_address_in_vma() on file THP tails · 31657170
      Jue Wang authored
      Anon THP tails were already supported, but memory-failure may need to
      use page_address_in_vma() on file THP tails, which its page->mapping
      check did not permit: fix it.
      
      hughd adds: no current usage is known to hit the issue, but this does
      fix a subtle trap in a general helper: best fixed in stable sooner than
      later.
      
      Link: https://lkml.kernel.org/r/a0d9b53-bf5d-8bab-ac5-759dc61819c1@google.com
      Fixes: 800d8c63
      
       ("shmem: add huge pages support")
      Signed-off-by: default avatarJue Wang <juew@google.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Wang Yugui <wangyugui@e16-tech.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      31657170
    • Hugh Dickins's avatar
      mm/thp: fix vma_address() if virtual address below file offset · 494334e4
      Hugh Dickins authored
      Running certain tests with a DEBUG_VM kernel would crash within hours,
      on the total_mapcount BUG() in split_huge_page_to_list(), while trying
      to free up some memory by punching a hole in a shmem huge page: split's
      try_to_unmap() was unable to find all the mappings of the page (which,
      on a !DEBUG_VM kernel, would then keep the huge page pinned in memory).
      
      When that BUG() was changed to a WARN(), it would later crash on the
      VM_BUG_ON_VMA(end < vma->vm_start || start >= vma->vm_end, vma) in
      mm/internal.h:vma_address(), used by rmap_walk_file() for
      try_to_unmap().
      
      vma_address() is usually correct, but there's a wraparound case when the
      vm_start address is unusually low, but vm_pgoff not so low:
      vma_address() chooses max(start, vma->vm_start), but that decides on the
      wrong address, because start has become almost ULONG_MAX.
      
      Rewrite vma_address() to be more careful about vm_pgoff; move the
      VM_BUG_ON_VMA() out of it, returning -EFAULT for errors, so that it can
      be safely used from page_mapped_in_vma() and page_address_in_vma() too.
      
      Add vma_address_end() to apply similar care to end address calculation,
      in page_vma_mapped_walk() and page_mkclean_one() and try_to_unmap_one();
      though it raises a question of whether callers would do better to supply
      pvmw->end to page_vma_mapped_walk() - I chose not, for a smaller patch.
      
      An irritation is that their apparent generality breaks down on KSM
      pages, which cannot be located by the page->index that page_to_pgoff()
      uses: as commit 4b0ece6f ("mm: migrate: fix remove_migration_pte()
      for ksm pages") once discovered.  I dithered over the best thing to do
      about that, and have ended up with a VM_BUG_ON_PAGE(PageKsm) in both
      vma_address() and vma_address_end(); though the only place in danger of
      using it on them was try_to_unmap_one().
      
      Sidenote: vma_address() and vma_address_end() now use compound_nr() on a
      head page, instead of thp_size(): to make the right calculation on a
      hugetlbfs page, whether or not THPs are configured.  try_to_unmap() is
      used on hugetlbfs pages, but perhaps the wrong calculation never
      mattered.
      
      Link: https://lkml.kernel.org/r/caf1c1a3-7cfb-7f8f-1beb-ba816e932825@google.com
      Fixes: a8fa41ad
      
       ("mm, rmap: check all VMAs that PTE-mapped THP can be part of")
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jue Wang <juew@google.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Wang Yugui <wangyugui@e16-tech.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      494334e4
    • Hugh Dickins's avatar
      mm/thp: try_to_unmap() use TTU_SYNC for safe splitting · 732ed558
      Hugh Dickins authored
      Stressing huge tmpfs often crashed on unmap_page()'s VM_BUG_ON_PAGE
      (!unmap_success): with dump_page() showing mapcount:1, but then its raw
      struct page output showing _mapcount ffffffff i.e.  mapcount 0.
      
      And even if that particular VM_BUG_ON_PAGE(!unmap_success) is removed,
      it is immediately followed by a VM_BUG_ON_PAGE(compound_mapcount(head)),
      and further down an IS_ENABLED(CONFIG_DEBUG_VM) total_mapcount BUG():
      all indicative of some mapcount difficulty in development here perhaps.
      But the !CONFIG_DEBUG_VM path handles the failures correctly and
      silently.
      
      I believe the problem is that once a racing unmap has cleared pte or
      pmd, try_to_unmap_one() may skip taking the page table lock, and emerge
      from try_to_unmap() before the racing task has reached decrementing
      mapcount.
      
      Instead of abandoning the unsafe VM_BUG_ON_PAGE(), and the ones that
      follow, use PVMW_SYNC in try_to_unmap_one() in this case: adding
      TTU_SYNC to the options, and passing that from unmap_page().
      
      When CONFIG_DEBUG_VM, or for non-debug too? Consensus is to do the same
      for both: the slight overhead added should rarely matter, except perhaps
      if splitting sparsely-populated multiply-mapped shmem.  Once confident
      that bugs are fixed, TTU_SYNC here can be removed, and the race
      tolerated.
      
      Link: https://lkml.kernel.org/r/c1e95853-8bcd-d8fd-55fa-e7f2488e78f@google.com
      Fixes: fec89c10
      
       ("thp: rewrite freeze_page()/unfreeze_page() with generic rmap walkers")
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jue Wang <juew@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Wang Yugui <wangyugui@e16-tech.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      732ed558
    • Hugh Dickins's avatar
      mm/thp: make is_huge_zero_pmd() safe and quicker · 3b77e8c8
      Hugh Dickins authored
      Most callers of is_huge_zero_pmd() supply a pmd already verified
      present; but a few (notably zap_huge_pmd()) do not - it might be a pmd
      migration entry, in which the pfn is encoded differently from a present
      pmd: which might pass the is_huge_zero_pmd() test (though not on x86,
      since L1TF forced us to protect against that); or perhaps even crash in
      pmd_page() applied to a swap-like entry.
      
      Make it safe by adding pmd_present() check into is_huge_zero_pmd()
      itself; and make it quicker by saving huge_zero_pfn, so that
      is_huge_zero_pmd() will not need to do that pmd_page() lookup each time.
      
      __split_huge_pmd_locked() checked pmd_trans_huge() before: that worked,
      but is unnecessary now that is_huge_zero_pmd() checks present.
      
      Link: https://lkml.kernel.org/r/21ea9ca-a1f5-8b90-5e88-95fb1c49bbfa@google.com
      Fixes: e71769ae
      
       ("mm: enable thp migration for shmem thp")
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jue Wang <juew@google.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Wang Yugui <wangyugui@e16-tech.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3b77e8c8
    • Hugh Dickins's avatar
      mm/thp: fix __split_huge_pmd_locked() on shmem migration entry · 99fa8a48
      Hugh Dickins authored
      Patch series "mm/thp: fix THP splitting unmap BUGs and related", v10.
      
      Here is v2 batch of long-standing THP bug fixes that I had not got
      around to sending before, but prompted now by Wang Yugui's report
      https://lore.kernel.org/linux-mm/20210412180659.B9E3.409509F4@e16-tech.com/
      
      Wang Yugui has tested a rollup of these fixes applied to 5.10.39, and
      they have done no harm, but have *not* fixed that issue: something more
      is needed and I have no idea of what.
      
      This patch (of 7):
      
      Stressing huge tmpfs page migration racing hole punch often crashed on
      the VM_BUG_ON(!pmd_present) in pmdp_huge_clear_flush(), with DEBUG_VM=y
      kernel; or shortly afterwards, on a bad dereference in
      __split_huge_pmd_locked() when DEBUG_VM=n.  They forgot to allow for pmd
      migration entries in the non-anonymous case.
      
      Full disclosure: those particular experiments were on a kernel with more
      relaxed mmap_lock and i_mmap_rwsem locking, and were not repeated on the
      vanilla kernel: it is conceivable that stricter locking happens to avoid
      those cases, or makes them less likely; but __split_huge_pmd_locked()
      already allowed for pmd migration entries when handling anonymous THPs,
      so this commit brings the shmem and file THP handling into line.
      
      And while there: use old_pmd rather than _pmd, as in the following
      blocks; and make it clearer to the eye that the !vma_is_anonymous()
      block is self-contained, making an early return after accounting for
      unmapping.
      
      Link: https://lkml.kernel.org/r/af88612-1473-2eaa-903-8d1a448b26@google.com
      Link: https://lkml.kernel.org/r/dd221a99-efb3-cd1d-6256-7e646af29314@google.com
      Fixes: e71769ae
      
       ("mm: enable thp migration for shmem thp")
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Wang Yugui <wangyugui@e16-tech.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Jue Wang <juew@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      99fa8a48
    • Xu Yu's avatar
      mm, thp: use head page in __migration_entry_wait() · ffc90cbb
      Xu Yu authored
      We notice that hung task happens in a corner but practical scenario when
      CONFIG_PREEMPT_NONE is enabled, as follows.
      
      Process 0                       Process 1                     Process 2..Inf
      split_huge_page_to_list
          unmap_page
              split_huge_pmd_address
                                      __migration_entry_wait(head)
                                                                    __migration_entry_wait(tail)
          remap_page (roll back)
              remove_migration_ptes
                  rmap_walk_anon
                      cond_resched
      
      Where __migration_entry_wait(tail) is occurred in kernel space, e.g.,
      copy_to_user in fstat, which will immediately fault again without
      rescheduling, and thus occupy the cpu fully.
      
      When there are too many processes performing __migration_entry_wait on
      tail page, remap_page will never be done after cond_resched.
      
      This makes __migration_entry_wait operate on the compound head page,
      thus waits for remap_page to complete, whether the THP is split
      successfully or roll back.
      
      Note that put_and_wait_on_page_locked helps to drop the page reference
      acquired with get_page_unless_zero, as soon as the page is on the wait
      queue, before actually waiting.  So splitting the THP is only prevented
      for a brief interval.
      
      Link: https://lkml.kernel.org/r/b9836c1dd522e903891760af9f0c86a2cce987eb.1623144009.git.xuyu@linux.alibaba.com
      Fixes: ba988280
      
       ("thp: add option to setup migration entries during PMD split")
      Suggested-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarGang Deng <gavin.dg@linux.alibaba.com>
      Signed-off-by: default avatarXu Yu <xuyu@linux.alibaba.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ffc90cbb
    • Andrew Morton's avatar
      mm/slub.c: include swab.h · 1b3865d0
      Andrew Morton authored
      Fixes build with CONFIG_SLAB_FREELIST_HARDENED=y.
      
      Hopefully.  But it's the right thing to do anwyay.
      
      Fixes: 1ad53d9f ("slub: improve bit diffusion for freelist ptr obfuscation")
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=213417
      
      
      Reported-by: default avatar <vannguye@cisco.com>
      Acked-by: default avatarKees Cook <keescook@chromium.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1b3865d0
    • Pingfan Liu's avatar
      crash_core, vmcoreinfo: append 'SECTION_SIZE_BITS' to vmcoreinfo · 4f5aecdf
      Pingfan Liu authored
      As mentioned in kernel commit 1d50e5d0 ("crash_core, vmcoreinfo:
      Append 'MAX_PHYSMEM_BITS' to vmcoreinfo"), SECTION_SIZE_BITS in the
      formula:
      
          #define SECTIONS_SHIFT    (MAX_PHYSMEM_BITS - SECTION_SIZE_BITS)
      
      Besides SECTIONS_SHIFT, SECTION_SIZE_BITS is also used to calculate
      PAGES_PER_SECTION in makedumpfile just like kernel.
      
      Unfortunately, this arch-dependent macro SECTION_SIZE_BITS changes, e.g.
      recently in kernel commit f0b13ee2 ("arm64/sparsemem: reduce
      SECTION_SIZE_BITS").  But user space wants a stable interface to get
      this info.  Such info is impossible to be deduced from a crashdump
      vmcore.  Hence append SECTION_SIZE_BITS to vmcoreinfo.
      
      Link: https://lkml.kernel.org/r/20210608103359.84907-1-kernelfans@gmail.com
      Link: http://lists.infradead.org/pipermail/kexec/2021-June/022676.html
      
      
      Signed-off-by: default avatarPingfan Liu <kernelfans@gmail.com>
      Acked-by: default avatarBaoquan He <bhe@redhat.com>
      Cc: Bhupesh Sharma <bhupesh.sharma@linaro.org>
      Cc: Kazuhito Hagio <k-hagio@ab.jp.nec.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Boris Petkov <bp@alien8.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: James Morse <james.morse@arm.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dave Anderson <anderson@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4f5aecdf
    • yangerkun's avatar
      mm/memory-failure: make sure wait for page writeback in memory_failure · e8675d29
      yangerkun authored
      Our syzkaller trigger the "BUG_ON(!list_empty(&inode->i_wb_list))" in
      clear_inode:
      
        kernel BUG at fs/inode.c:519!
        Internal error: Oops - BUG: 0 [#1] SMP
        Modules linked in:
        Process syz-executor.0 (pid: 249, stack limit = 0x00000000a12409d7)
        CPU: 1 PID: 249 Comm: syz-executor.0 Not tainted 4.19.95
        Hardware name: linux,dummy-virt (DT)
        pstate: 80000005 (Nzcv daif -PAN -UAO)
        pc : clear_inode+0x280/0x2a8
        lr : clear_inode+0x280/0x2a8
        Call trace:
          clear_inode+0x280/0x2a8
          ext4_clear_inode+0x38/0xe8
          ext4_free_inode+0x130/0xc68
          ext4_evict_inode+0xb20/0xcb8
          evict+0x1a8/0x3c0
          iput+0x344/0x460
          do_unlinkat+0x260/0x410
          __arm64_sys_unlinkat+0x6c/0xc0
          el0_svc_common+0xdc/0x3b0
          el0_svc_handler+0xf8/0x160
          el0_svc+0x10/0x218
        Kernel panic - not syncing: Fatal exception
      
      A crash dump of this problem show that someone called __munlock_pagevec
      to clear page LRU without lock_page: do_mmap -> mmap_region -> do_munmap
      -> munlock_vma_pages_range -> __munlock_pagevec.
      
      As a result memory_failure will call identify_page_state without
      wait_on_page_writeback.  And after truncate_error_page clear the mapping
      of this page.  end_page_writeback won't call sb_clear_inode_writeback to
      clear inode->i_wb_list.  That will trigger BUG_ON in clear_inode!
      
      Fix it by checking PageWriteback too to help determine should we skip
      wait_on_page_writeback.
      
      Link: https://lkml.kernel.org/r/20210604084705.3729204-1-yangerkun@huawei.com
      Fixes: 0bc1f8b0
      
       ("hwpoison: fix the handling path of the victimized page frame that belong to non-LRU")
      Signed-off-by: default avataryangerkun <yangerkun@huawei.com>
      Acked-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Yu Kuai <yukuai3@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e8675d29
    • Mike Kravetz's avatar
      mm/hugetlb: expand restore_reserve_on_error functionality · 846be085
      Mike Kravetz authored
      The routine restore_reserve_on_error is called to restore reservation
      information when an error occurs after page allocation.  The routine
      alloc_huge_page modifies the mapping reserve map and potentially the
      reserve count during allocation.  If code calling alloc_huge_page
      encounters an error after allocation and needs to free the page, the
      reservation information needs to be adjusted.
      
      Currently, restore_reserve_on_error only takes action on pages for which
      the reserve count was adjusted(HPageRestoreReserve flag).  There is
      nothing wrong with these adjustments.  However, alloc_huge_page ALWAYS
      modifies the reserve map during allocation even if the reserve count is
      not adjusted.  This can cause issues as observed during development of
      this patch [1].
      
      One specific series of operations causing an issue is:
      
       - Create a shared hugetlb mapping
         Reservations for all pages created by default
      
       - Fault in a page in the mapping
         Reservation exists so reservation count is decremented
      
       - Punch a hole in the file/mapping at index previously faulted
         Reservation and any associated pages will be removed
      
       - Allocate a page to fill the hole
         No reservation entry, so reserve count unmodified
         Reservation entry added to map by alloc_huge_page
      
       - Error after allocation and before instantiating the page
         Reservation entry remains in map
      
       - Allocate a page to fill the hole
         Reservation entry exists, so decrement reservation count
      
      This will cause a reservation count underflow as the reservation count
      was decremented twice for the same index.
      
      A user would observe a very large number for HugePages_Rsvd in
      /proc/meminfo.  This would also likely cause subsequent allocations of
      hugetlb pages to fail as it would 'appear' that all pages are reserved.
      
      This sequence of operations is unlikely to happen, however they were
      easily reproduced and observed using hacked up code as described in [1].
      
      Address the issue by having the routine restore_reserve_on_error take
      action on pages where HPageRestoreReserve is not set.  In this case, we
      need to remove any reserve map entry created by alloc_huge_page.  A new
      helper routine vma_del_reservation assists with this operation.
      
      There are three callers of alloc_huge_page which do not currently call
      restore_reserve_on error before freeing a page on error paths.  Add
      those missing calls.
      
      [1] https://lore.kernel.org/linux-mm/20210528005029.88088-1-almasrymina@google.com/
      
      Link: https://lkml.kernel.org/r/20210607204510.22617-1-mike.kravetz@oracle.com
      Fixes: 96b96a96
      
       ("mm/hugetlb: fix huge page reservation leak in private mapping error paths"
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarMina Almasry <almasrymina@google.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      846be085
    • Kees Cook's avatar
      mm/slub: actually fix freelist pointer vs redzoning · e41a49fa
      Kees Cook authored
      It turns out that SLUB redzoning ("slub_debug=Z") checks from
      s->object_size rather than from s->inuse (which is normally bumped to
      make room for the freelist pointer), so a cache created with an object
      size less than 24 would have the freelist pointer written beyond
      s->object_size, causing the redzone to be corrupted by the freelist
      pointer.  This was very visible with "slub_debug=ZF":
      
        BUG test (Tainted: G    B            ): Right Redzone overwritten
        -----------------------------------------------------------------------------
      
        INFO: 0xffff957ead1c05de-0xffff957ead1c05df @offset=1502. First byte 0x1a instead of 0xbb
        INFO: Slab 0xffffef3950b47000 objects=170 used=170 fp=0x0000000000000000 flags=0x8000000000000200
        INFO: Object 0xffff957ead1c05d8 @offset=1496 fp=0xffff957ead1c0620
      
        Redzone  (____ptrval____): bb bb bb bb bb bb bb bb               ........
        Object   (____ptrval____): 00 00 00 00 00 f6 f4 a5               ........
        Redzone  (____ptrval____): 40 1d e8 1a aa                        @....
        Padding  (____ptrval____): 00 00 00 00 00 00 00 00               ........
      
      Adjust the offset to stay within s->object_size.
      
      (Note that no caches of in this size range are known to exist in the
      kernel currently.)
      
      Link: https://lkml.kernel.org/r/20210608183955.280836-4-keescook@chromium.org
      Link: https://lore.kernel.org/linux-mm/20200807160627.GA1420741@elver.google.com/
      Link: https://lore.kernel.org/lkml/0f7dd7b2-7496-5e2d-9488-2ec9f8e90441@suse.cz/Fixes: 89b83f28 (slub: avoid redzone when choosing freepointer location)
      Link: https://lore.kernel.org/lkml/CANpmjNOwZ5VpKQn+SYWovTkFB4VsT-RPwyENBmaK0dLcpqStkA@mail.gmail.com
      
      
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Reported-by: default avatarMarco Elver <elver@google.com>
      Reported-by: default avatar"Lin, Zhenpeng" <zplin@psu.edu>
      Tested-by: default avatarMarco Elver <elver@google.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e41a49fa
    • Kees Cook's avatar
      mm/slub: fix redzoning for small allocations · 74c1d3e0
      Kees Cook authored
      The redzone area for SLUB exists between s->object_size and s->inuse
      (which is at least the word-aligned object_size).  If a cache were
      created with an object_size smaller than sizeof(void *), the in-object
      stored freelist pointer would overwrite the redzone (e.g.  with boot
      param "slub_debug=ZF"):
      
        BUG test (Tainted: G    B            ): Right Redzone overwritten
        -----------------------------------------------------------------------------
      
        INFO: 0xffff957ead1c05de-0xffff957ead1c05df @offset=1502. First byte 0x1a instead of 0xbb
        INFO: Slab 0xffffef3950b47000 objects=170 used=170 fp=0x0000000000000000 flags=0x8000000000000200
        INFO: Object 0xffff957ead1c05d8 @offset=1496 fp=0xffff957ead1c0620
      
        Redzone  (____ptrval____): bb bb bb bb bb bb bb bb    ........
        Object   (____ptrval____): f6 f4 a5 40 1d e8          ...@..
        Redzone  (____ptrval____): 1a aa                      ..
        Padding  (____ptrval____): 00 00 00 00 00 00 00 00    ........
      
      Store the freelist pointer out of line when object_size is smaller than
      sizeof(void *) and redzoning is enabled.
      
      Additionally remove the "smaller than sizeof(void *)" check under
      CONFIG_DEBUG_VM in kmem_cache_sanity_check() as it is now redundant:
      SLAB and SLOB both handle small sizes.
      
      (Note that no caches within this size range are known to exist in the
      kernel currently.)
      
      Link: https://lkml.kernel.org/r/20210608183955.280836-3-keescook@chromium.org
      Fixes: 81819f0f
      
       ("SLUB core")
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: "Lin, Zhenpeng" <zplin@psu.edu>
      Cc: Marco Elver <elver@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      74c1d3e0
    • Kees Cook's avatar
      mm/slub: clarify verification reporting · 8669dbab
      Kees Cook authored
      Patch series "Actually fix freelist pointer vs redzoning", v4.
      
      This fixes redzoning vs the freelist pointer (both for middle-position
      and very small caches).  Both are "theoretical" fixes, in that I see no
      evidence of such small-sized caches actually be used in the kernel, but
      that's no reason to let the bugs continue to exist, especially since
      people doing local development keep tripping over it.  :)
      
      This patch (of 3):
      
      Instead of repeating "Redzone" and "Poison", clarify which sides of
      those zones got tripped.  Additionally fix column alignment in the
      trailer.
      
      Before:
      
        BUG test (Tainted: G    B            ): Redzone overwritten
        ...
        Redzone (____ptrval____): bb bb bb bb bb bb bb bb      ........
        Object (____ptrval____): f6 f4 a5 40 1d e8            ...@..
        Redzone (____ptrval____): 1a aa                        ..
        Padding (____ptrval____): 00 00 00 00 00 00 00 00      ........
      
      After:
      
        BUG test (Tainted: G    B            ): Right Redzone overwritten
        ...
        Redzone  (____ptrval____): bb bb bb bb bb bb bb bb      ........
        Object   (____ptrval____): f6 f4 a5 40 1d e8            ...@..
        Redzone  (____ptrval____): 1a aa                        ..
        Padding  (____ptrval____): 00 00 00 00 00 00 00 00      ........
      
      The earlier commits that slowly resulted in the "Before" reporting were:
      
        d86bd1be ("mm/slub: support left redzone")
        ffc79d28 ("slub: use print_hex_dump")
        24922684 ("SLUB: change error reporting format to follow lockdep loosely")
      
      Link: https://lkml.kernel.org/r/20210608183955.280836-1-keescook@chromium.org
      Link: https://lkml.kernel.org/r/20210608183955.280836-2-keescook@chromium.org
      Link: https://lore.kernel.org/lkml/cfdb11d7-fb8e-e578-c939-f7f5fb69a6bd@suse.cz/
      
      
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Marco Elver <elver@google.com>
      Cc: "Lin, Zhenpeng" <zplin@psu.edu>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8669dbab
    • Peter Xu's avatar
      mm/swap: fix pte_same_as_swp() not removing uffd-wp bit when compare · 099dd687
      Peter Xu authored
      I found it by pure code review, that pte_same_as_swp() of unuse_vma()
      didn't take uffd-wp bit into account when comparing ptes.
      pte_same_as_swp() returning false negative could cause failure to
      swapoff swap ptes that was wr-protected by userfaultfd.
      
      Link: https://lkml.kernel.org/r/20210603180546.9083-1-peterx@redhat.com
      Fixes: f45ec5ff
      
       ("userfaultfd: wp: support swap and page migration")
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org>	[5.7+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      099dd687
    • Naoya Horiguchi's avatar
      mm,hwpoison: fix race with hugetlb page allocation · 25182f05
      Naoya Horiguchi authored
      When hugetlb page fault (under overcommitting situation) and
      memory_failure() race, VM_BUG_ON_PAGE() is triggered by the following
      race:
      
          CPU0:                           CPU1:
      
                                          gather_surplus_pages()
                                            page = alloc_surplus_huge_page()
          memory_failure_hugetlb()
            get_hwpoison_page(page)
              __get_hwpoison_page(page)
                get_page_unless_zero(page)
                                            zero = put_page_testzero(page)
                                            VM_BUG_ON_PAGE(!zero, page)
                                            enqueue_huge_page(h, page)
            put_page(page)
      
      __get_hwpoison_page() only checks the page refcount before taking an
      additional one for memory error handling, which is not enough because
      there's a time window where compound pages have non-zero refcount during
      hugetlb page initialization.
      
      So make __get_hwpoison_page() check page status a bit more for hugetlb
      pages with get_hwpoison_huge_page().  Checking hugetlb-specific flags
      under hugetlb_lock makes sure that the hugetlb page is not transitive.
      It's notable that another new function, HWPoisonHandlable(), is helpful
      to prevent a race against other transitive page states (like a generic
      compound page just before PageHuge becomes true).
      
      Link: https://lkml.kernel.org/r/20210603233632.2964832-2-nao.horiguchi@gmail.com
      Fixes: ead07f6a
      
       ("mm/memory-failure: introduce get_hwpoison_page() for consistent refcount handling")
      Signed-off-by: default avatarNaoya Horiguchi <naoya.horiguchi@nec.com>
      Reported-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Acked-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: <stable@vger.kernel.org>	[5.12+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      25182f05
    • Linus Torvalds's avatar
      Merge tag 'dmaengine-fix-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine · 6b00bc63
      Linus Torvalds authored
      Pull dmaengine fixes from Vinod Koul:
       "A bunch of driver fixes, notably:
      
         - More idxd fixes for driver unregister, error handling and bus
           assignment
      
         - HAS_IOMEM depends fix for few drivers
      
         - lock fix in pl330 driver
      
         - xilinx drivers fixes for initialize registers, missing dependencies
           and limiting descriptor IDs
      
         - mediatek descriptor management fixes"
      
      * tag 'dmaengine-fix-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/dmaengine:
        dmaengine: mediatek: use GFP_NOWAIT instead of GFP_ATOMIC in prep_dma
        dmaengine: mediatek: do not issue a new desc if one is still current
        dmaengine: mediatek: free the proper desc in desc_free handler
        dmaengine: ipu: fix doc warning in ipu_irq.c
        dmaengine: rcar-dmac: Fix PM reference leak in rcar_dmac_probe()
        dmaengine: idxd: Fix missing error code in idxd_cdev_open()
        dmaengine: stedma40: add missing iounmap() on error in d40_probe()
        dmaengine: SF_PDMA depends on HAS_IOMEM
        dmaengine: QCOM_HIDMA_MGMT depends on HAS_IOMEM
        dmaengine: ALTERA_MSGDMA depends on HAS_IOMEM
        dmaengine: idxd: Add missing cleanup for early error out in probe call
        dmaengine: xilinx: dpdma: Limit descriptor IDs to 16 bits
        dmaengine: xilinx: dpdma: Add missing dependencies to Kconfig
        dmaengine: stm32-mdma: fix PM reference leak in stm32_mdma_alloc_chan_resourc()
        dmaengine: zynqmp_dma: Fix PM reference leak in zynqmp_dma_alloc_chan_resourc()
        dmaengine: xilinx: dpdma: initialize registers before request_irq
        dmaengine: pl330: fix wrong usage of spinlock flags in dma_cyclc
        dmaengine: fsl-dpaa2-qdma: Fix error return code in two functions
        dmaengine: idxd: add missing dsa driver unregister
        dmaengine: idxd: add engine 'struct device' missing bus type assignment
      6b00bc63
  4. Jun 16, 2021