Skip to content
  1. Oct 24, 2017
  2. Oct 17, 2017
    • Arnd Bergmann's avatar
      xfs: move two more RT specific functions into CONFIG_XFS_RT · 785545c8
      Arnd Bergmann authored
      The last cleanup introduced two harmless warnings:
      
      fs/xfs/xfs_fsmap.c:480:1: warning: '__xfs_getfsmap_rtdev' defined but not used
      fs/xfs/xfs_fsmap.c:372:1: warning: 'xfs_getfsmap_rtdev_rtbitmap_helper' defined but not used
      
      This moves those two functions as well.
      
      Fixes: bb9c2e54
      
       ("xfs: move more RT specific code under CONFIG_XFS_RT")
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Reviewed-by: default avatarBrian Foster <bfoster@redhat.com>
      Acked-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      785545c8
    • Brian Foster's avatar
      xfs: trim writepage mapping to within eof · 40214d12
      Brian Foster authored
      The writeback rework in commit fbcc0256
      
       ("xfs: Introduce
      writeback context for writepages") introduced a subtle change in
      behavior with regard to the block mapping used across the
      ->writepages() sequence. The previous xfs_cluster_write() code would
      only flush pages up to EOF at the time of the writepage, thus
      ensuring that any pages due to file-extending writes would be
      handled on a separate cycle and with a new, updated block mapping.
      
      The updated code establishes a block mapping in xfs_writepage_map()
      that could extend beyond EOF if the file has post-eof preallocation.
      Because we now use the generic writeback infrastructure and pass the
      cached mapping to each writepage call, there is no implicit EOF
      limit in place. If eofblocks trimming occurs during ->writepages(),
      any post-eof portion of the cached mapping becomes invalid. The
      eofblocks code has no means to serialize against writeback because
      there are no pages associated with post-eof blocks. Therefore if an
      eofblocks trim occurs and is followed by a file-extending buffered
      write, not only has the mapping become invalid, but we could end up
      writing a page to disk based on the invalid mapping.
      
      Consider the following sequence of events:
      
      - A buffered write creates a delalloc extent and post-eof
        speculative preallocation.
      - Writeback starts and on the first writepage cycle, the delalloc
        extent is converted to real blocks (including the post-eof blocks)
        and the mapping is cached.
      - The file is closed and xfs_release() trims post-eof blocks. The
        cached writeback mapping is now invalid.
      - Another buffered write appends the file with a delalloc extent.
      - The concurrent writeback cycle picks up the just written page
        because the writeback range end is LLONG_MAX. xfs_writepage_map()
        attributes it to the (now invalid) cached mapping and writes the
        data to an incorrect location on disk (and where the file offset is
        still backed by a delalloc extent).
      
      This problem is reproduced by xfstests test generic/464, which
      triggers racing writes, appends, open/closes and writeback requests.
      
      To address this problem, trim the mapping used during writeback to
      within EOF when the mapping is validated. This ensures the mapping
      is revalidated for any pages encountered beyond EOF as of the time
      the current mapping was cached or last validated.
      
      Reported-by: default avatarEryu Guan <eguan@redhat.com>
      Diagnosed-by: default avatarEryu Guan <eguan@redhat.com>
      Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      40214d12
    • Eryu Guan's avatar
      fs: invalidate page cache after end_io() in dio completion · 5e25c269
      Eryu Guan authored
      Commit 332391a9 ("fs: Fix page cache inconsistency when mixing
      buffered and AIO DIO") moved page cache invalidation from
      iomap_dio_rw() to iomap_dio_complete() for iomap based direct write
      path, but before the dio->end_io() call, and it re-introdued the bug
      fixed by commit c771c14b ("iomap: invalidate page caches should
      be after iomap_dio_complete() in direct write").
      
      I found this because fstests generic/418 started failing on XFS with
      v4.14-rc3 kernel, which is the regression test for this specific
      bug.
      
      So similarly, fix it by moving dio->end_io() (which does the
      unwritten extent conversion) before page cache invalidation, to make
      sure next buffer read reads the final real allocations not unwritten
      extents. I also add some comments about why should end_io() go first
      in case we get it wrong again in the future.
      
      Note that, there's no such problem in the non-iomap based direct
      write path, because we didn't remove the page cache invalidation
      after the ->direct_IO() in generic_file_direct_write() call, but I
      decided to fix dio_complete() too so we don't leave a landmine
      there, also be consistent with iomap_dio_complete().
      
      Fixes: 332391a9
      
       ("fs: Fix page cache inconsistency when mixing buffered and AIO DIO")
      Signed-off-by: default avatarEryu Guan <eguan@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarLukas Czerner <lczerner@redhat.com>
      5e25c269
    • Dave Chinner's avatar
      xfs: cancel dirty pages on invalidation · 793d7dbe
      Dave Chinner authored
      
      
      Recently we've had warnings arise from the vm handing us pages
      without bufferheads attached to them. This should not ever occur
      in XFS, but we don't defend against it properly if it does. The only
      place where we remove bufferheads from a page is in
      xfs_vm_releasepage(), but we can't tell the difference here between
      "page is dirty so don't release" and "page is dirty but is being
      invalidated so release it".
      
      In some places that are invalidating pages ask for pages to be
      released and follow up afterward calling ->releasepage by checking
      whether the page was dirty and then aborting the invalidation. This
      is a possible vector for releasing buffers from a page but then
      leaving it in the mapping, so we really do need to avoid dirty pages
      in xfs_vm_releasepage().
      
      To differentiate between invalidated pages and normal pages, we need
      to clear the page dirty flag when invalidating the pages. This can
      be done through xfs_vm_invalidatepage(), and will result
      xfs_vm_releasepage() seeing the page as clean which matches the
      bufferhead state on the page after calling block_invalidatepage().
      
      Hence we can re-add the page dirty check in xfs_vm_releasepage to
      catch the case where we might be releasing a page that is actually
      dirty and so should not have the bufferheads on it removed. This
      will remove one possible vector of "dirty page with no bufferheads"
      and so help narrow down the search for the root cause of that
      problem.
      
      Signed-Off-By: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: default avatarDarrick J. Wong <darrick.wong@oracle.com>
      793d7dbe
  3. Oct 16, 2017
  4. Oct 15, 2017
  5. Oct 14, 2017
    • Borislav Petkov's avatar
      x86/microcode: Do the family check first · 1f161f67
      Borislav Petkov authored
      
      
      On CPUs like AMD's Geode, for example, we shouldn't even try to load
      microcode because they do not support the modern microcode loading
      interface.
      
      However, we do the family check *after* the other checks whether the
      loader has been disabled on the command line or whether we're running in
      a guest.
      
      So move the family checks first in order to exit early if we're being
      loaded on an unsupported family.
      
      Reported-and-tested-by: default avatarSven Glodowski <glodi1@arcor.de>
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Cc: <stable@vger.kernel.org> # 4.11..
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://bugzilla.suse.com/show_bug.cgi?id=1061396
      Link: http://lkml.kernel.org/r/20171012112316.977-1-bp@alien8.de
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      1f161f67
    • Ingo Molnar's avatar
      locking/lockdep: Disable cross-release features for now · b483cf3b
      Ingo Molnar authored
      Johan Hovold reported a big lockdep slowdown on his system, caused by lockdep:
      
      > I had noticed that the BeagleBone Black boot time appeared to have
      > increased significantly with 4.14 and yesterday I finally had time to
      > investigate it.
      >
      > Boot time (from "Linux version" to login prompt) had in fact doubled
      > since 4.13 where it took 17 seconds (with my current config) compared to
      > the 35 seconds I now see with 4.14-rc4.
      >
      > I quick bisect pointed to lockdep and specifically the following commit:
      >
      >	28a903f6
      
       ("locking/lockdep: Handle non(or multi)-acquisition of a crosslock")
      
      Because the final v4.14 release is close, disable the cross-release lockdep
      features for now.
      
      Bisected-by: default avatarJohan Hovold <johan@kernel.org>
      Debugged-by: default avatarJohan Hovold <johan@kernel.org>
      Reported-by: default avatarJohan Hovold <johan@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Byungchul Park <byungchul.park@lge.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Lindgren <tony@atomide.com>
      Cc: kernel-team@lge.com
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: linux-mm@kvack.org
      Cc: linux-omap@vger.kernel.org
      Link: http://lkml.kernel.org/r/20171014072659.f2yr6mhm5ha3eou7@gmail.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      b483cf3b
    • Linus Torvalds's avatar
      Merge branch '4.14-fixes' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus · be1f16ba
      Linus Torvalds authored
      Pull MIPS fixes from Ralf Baechle:
       "More MIPS fixes for 4.14:
      
         - Loongson 1: Set the default number of RX and TX queues to
           accomodate for recent changes of stmmac driver.
      
         - BPF: Fix uninitialised target compiler error.
      
         - Fix cmpxchg on 32 bit signed ints for 64 bit kernels with
           !kernel_uses_llsc
      
         - Fix generic-board-config.sh for builds using O=
      
         - Remove pr_err() calls from fpu_emu() for a case which is not a
           kernel error"
      
      * '4.14-fixes' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus:
        MIPS: math-emu: Remove pr_err() calls from fpu_emu()
        MIPS: Fix generic-board-config.sh for builds using O=
        MIPS: Fix cmpxchg on 32b signed ints for 64b kernel with !kernel_uses_llsc
        MIPS: loongson1: set default number of rx and tx queues for stmmac
        MIPS: bpf: Fix uninitialised target compiler error
      be1f16ba
    • Andy Lutomirski's avatar
      x86/mm: Flush more aggressively in lazy TLB mode · b956575b
      Andy Lutomirski authored
      Since commit:
      
        94b1b03b
      
       ("x86/mm: Rework lazy TLB mode and TLB freshness tracking")
      
      x86's lazy TLB mode has been all the way lazy: when running a kernel thread
      (including the idle thread), the kernel keeps using the last user mm's
      page tables without attempting to maintain user TLB coherence at all.
      
      From a pure semantic perspective, this is fine -- kernel threads won't
      attempt to access user pages, so having stale TLB entries doesn't matter.
      
      Unfortunately, I forgot about a subtlety.  By skipping TLB flushes,
      we also allow any paging-structure caches that may exist on the CPU
      to become incoherent.  This means that we can have a
      paging-structure cache entry that references a freed page table, and
      the CPU is within its rights to do a speculative page walk starting
      at the freed page table.
      
      I can imagine this causing two different problems:
      
       - A speculative page walk starting from a bogus page table could read
         IO addresses.  I haven't seen any reports of this causing problems.
      
       - A speculative page walk that involves a bogus page table can install
         garbage in the TLB.  Such garbage would always be at a user VA, but
         some AMD CPUs have logic that triggers a machine check when it notices
         these bogus entries.  I've seen a couple reports of this.
      
      Boris further explains the failure mode:
      
      > It is actually more of an optimization which assumes that paging-structure
      > entries are in WB DRAM:
      >
      > "TlbCacheDis: cacheable memory disable. Read-write. 0=Enables
      > performance optimization that assumes PML4, PDP, PDE, and PTE entries
      > are in cacheable WB-DRAM; memory type checks may be bypassed, and
      > addresses outside of WB-DRAM may result in undefined behavior or NB
      > protocol errors. 1=Disables performance optimization and allows PML4,
      > PDP, PDE and PTE entries to be in any memory type. Operating systems
      > that maintain page tables in memory types other than WB- DRAM must set
      > TlbCacheDis to insure proper operation."
      >
      > The MCE generated is an NB protocol error to signal that
      >
      > "Link: A specific coherent-only packet from a CPU was issued to an
      > IO link. This may be caused by software which addresses page table
      > structures in a memory type other than cacheable WB-DRAM without
      > properly configuring MSRC001_0015[TlbCacheDis]. This may occur, for
      > example, when page table structure addresses are above top of memory. In
      > such cases, the NB will generate an MCE if it sees a mismatch between
      > the memory operation generated by the core and the link type."
      >
      > I'm assuming coherent-only packets don't go out on IO links, thus the
      > error.
      
      To fix this, reinstate TLB coherence in lazy mode.  With this patch
      applied, we do it in one of two ways:
      
       - If we have PCID, we simply switch back to init_mm's page tables
         when we enter a kernel thread -- this seems to be quite cheap
         except for the cost of serializing the CPU.
      
       - If we don't have PCID, then we set a flag and switch to init_mm
         the first time we would otherwise need to flush the TLB.
      
      The /sys/kernel/debug/x86/tlb_use_lazy_mode debug switch can be changed
      to override the default mode for benchmarking.
      
      In theory, we could optimize this better by only flushing the TLB in
      lazy CPUs when a page table is freed.  Doing that would require
      auditing the mm code to make sure that all page table freeing goes
      through tlb_remove_page() as well as reworking some data structures
      to implement the improved flush logic.
      
      Reported-by: default avatarMarkus Trippelsdorf <markus@trippelsdorf.de>
      Reported-by: default avatarAdam Borowski <kilobyte@angband.pl>
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Eric Biggers <ebiggers@google.com>
      Cc: Johannes Hirte <johannes.hirte@datenkhaos.de>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Roman Kagan <rkagan@virtuozzo.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 94b1b03b ("x86/mm: Rework lazy TLB mode and TLB freshness tracking")
      Link: http://lkml.kernel.org/r/20171009170231.fkpraqokz6e4zeco@pd.tnic
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      b956575b
    • Linus Torvalds's avatar
      Merge tag 'drm-fixes-for-v4.14-rc5' of git://people.freedesktop.org/~airlied/linux · 9aa0d2dd
      Linus Torvalds authored
      Pull drm fixes from Dave Airlie:
       "Couple of the arm people seem to wake up so this has imx and msm
        fixes, along with a bunch of i915 stable bounds fixes and an amdgpu
        regression fix.
      
        All seems pretty okay for now"
      
      * tag 'drm-fixes-for-v4.14-rc5' of git://people.freedesktop.org/~airlied/linux:
        drm/msm: fix _NO_IMPLICIT fencing case
        drm/msm: fix error path cleanup
        drm/msm/mdp5: Remove extra pm_runtime_put call in mdp5_crtc_cursor_set()
        drm/msm/dsi: Use correct pm_runtime_put variant during host_init
        drm/msm: fix return value check in _msm_gem_kernel_new()
        drm/msm: use proper memory barriers for updating tail/head
        drm/msm/mdp5: add missing max size for 8x74 v1
        drm/amdgpu: fix placement flags in amdgpu_ttm_bind
        drm/i915/bios: parse DDI ports also for CHV for HDMI DDC pin and DP AUX channel
        gpu: ipu-v3: pre: implement workaround for ERR009624
        gpu: ipu-v3: prg: wait for double buffers to be filled on channel startup
        gpu: ipu-v3: Allow channel burst locking on i.MX6 only
        drm/i915: Read timings from the correct transcoder in intel_crtc_mode_get()
        drm/i915: Order two completing nop_submit_request
        drm/i915: Silence compiler warning for hsw_power_well_enable()
        drm/i915: Use crtc_state_is_legacy_gamma in intel_color_check
        drm/i915/edp: Increase the T12 delay quirk to 1300ms
        drm/i915/edp: Get the Panel Power Off timestamp after panel is off
        sync_file: Return consistent status in SYNC_IOC_FILE_INFO
        drm/atomic: Unref duplicated drm_atomic_state in drm_atomic_helper_resume()
      9aa0d2dd
    • Dave Airlie's avatar
      Merge tag 'drm-intel-fixes-2017-10-11' of... · a480f308
      Dave Airlie authored
      Merge tag 'drm-intel-fixes-2017-10-11' of git://anongit.freedesktop.org/drm/drm-intel into drm-fixes
      
      drm/i915 fixes for 4.14-rc5:
      
      Three fixes for stable:
      
      - Use crtc_state_is_legacy_gamma in intel_color_check (Maarten)
      - Read timings from the correct transcoder (Ville).
      - Fix HDMI on BSW (Jani).
      
      Other fixes:
      
      - eDP fixes (Manasi)
      - Silence compiler warnings (Chris)
      - Order two completing nop_submit_request (Chris)
      
      * tag 'drm-intel-fixes-2017-10-11' of git://anongit.freedesktop.org/drm/drm-intel:
        drm/i915/bios: parse DDI ports also for CHV for HDMI DDC pin and DP AUX channel
        drm/i915: Read timings from the correct transcoder in intel_crtc_mode_get()
        drm/i915: Order two completing nop_submit_request
        drm/i915: Silence compiler warning for hsw_power_well_enable()
        drm/i915: Use crtc_state_is_legacy_gamma in intel_color_check
        drm/i915/edp: Increase the T12 delay quirk to 1300ms
        drm/i915/edp: Get the Panel Power Off timestamp after panel is off
      a480f308
    • Dave Airlie's avatar
      Merge branch 'msm-fixes-4.14-rc4' of git://people.freedesktop.org/~robclark/linux into drm-fixes · 7a5bea77
      Dave Airlie authored
      bunch of msm fixes
      
      * 'msm-fixes-4.14-rc4' of git://people.freedesktop.org/~robclark/linux:
        drm/msm: fix _NO_IMPLICIT fencing case
        drm/msm: fix error path cleanup
        drm/msm/mdp5: Remove extra pm_runtime_put call in mdp5_crtc_cursor_set()
        drm/msm/dsi: Use correct pm_runtime_put variant during host_init
        drm/msm: fix return value check in _msm_gem_kernel_new()
        drm/msm: use proper memory barriers for updating tail/head
        drm/msm/mdp5: add missing max size for 8x74 v1
      7a5bea77
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · 06d97c58
      Linus Torvalds authored
      Merge misc fixes from Andrew Morton:
       "18 fixes"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
        mm, swap: use page-cluster as max window of VMA based swap readahead
        mm: page_vma_mapped: ensure pmd is loaded with READ_ONCE outside of lock
        kmemleak: clear stale pointers from task stacks
        fs/binfmt_misc.c: node could be NULL when evicting inode
        fs/mpage.c: fix mpage_writepage() for pages with buffers
        linux/kernel.h: add/correct kernel-doc notation
        tty: fall back to N_NULL if switching to N_TTY fails during hangup
        Revert "vmalloc: back off when the current task is killed"
        mm/cma.c: take __GFP_NOWARN into account in cma_alloc()
        scripts/kallsyms.c: ignore symbol type 'n'
        userfaultfd: selftest: exercise -EEXIST only in background transfer
        mm: only display online cpus of the numa node
        mm: remove unnecessary WARN_ONCE in page_vma_mapped_walk().
        mm/mempolicy: fix NUMA_INTERLEAVE_HIT counter
        include/linux/of.h: provide of_n_{addr,size}_cells wrappers for !CONFIG_OF
        mm/madvise.c: add description for MADV_WIPEONFORK and MADV_KEEPONFORK
        lib/Kconfig.debug: kernel hacking menu: runtime testing: keep tests together
        mm/migrate: fix indexing bug (off by one) and avoid out of bound access
      06d97c58
    • Huang Ying's avatar
      mm, swap: use page-cluster as max window of VMA based swap readahead · 61b63972
      Huang Ying authored
      When the VMA based swap readahead was introduced, a new knob
      
        /sys/kernel/mm/swap/vma_ra_max_order
      
      was added as the max window of VMA swap readahead.  This is to make it
      possible to use different max window for VMA based readahead and
      original physical readahead.  But Minchan Kim pointed out that this will
      cause a regression because setting page-cluster sysctl to zero cannot
      disable swap readahead with the change.
      
      To fix the regression, the page-cluster sysctl is used as the max window
      of both the VMA based swap readahead and original physical swap
      readahead.  If more fine grained control is needed in the future, more
      knobs can be added as the subordinate knobs of the page-cluster sysctl.
      
      The vma_ra_max_order knob is deleted.  Because the knob was introduced
      in v4.14-rc1, and this patch is targeting being merged before v4.14
      releasing, there should be no existing users of this newly added ABI.
      
      Link: http://lkml.kernel.org/r/20171011070847.16003-1-ying.huang@intel.com
      Fixes: ec560175
      
       ("mm, swap: VMA based swap readahead")
      Signed-off-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reported-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      61b63972
    • Will Deacon's avatar
      mm: page_vma_mapped: ensure pmd is loaded with READ_ONCE outside of lock · a7b10095
      Will Deacon authored
      Loading the pmd without holding the pmd_lock exposes us to races with
      concurrent updaters of the page tables but, worse still, it also allows
      the compiler to cache the pmd value in a register and reuse it later on,
      even if we've performed a READ_ONCE in between and seen a more recent
      value.
      
      In the case of page_vma_mapped_walk, this leads to the following crash
      when the pmd loaded for the initial pmd_trans_huge check is all zeroes
      and a subsequent valid table entry is loaded by check_pmd.  We then
      proceed into map_pte, but the compiler re-uses the zero entry inside
      pte_offset_map, resulting in a junk pointer being installed in
      pvmw->pte:
      
        PC is at check_pte+0x20/0x170
        LR is at page_vma_mapped_walk+0x2e0/0x540
        [...]
        Process doio (pid: 2463, stack limit = 0xffff00000f2e8000)
        Call trace:
          check_pte+0x20/0x170
          page_vma_mapped_walk+0x2e0/0x540
          page_mkclean_one+0xac/0x278
          rmap_walk_file+0xf0/0x238
          rmap_walk+0x64/0xa0
          page_mkclean+0x90/0xa8
          clear_page_dirty_for_io+0x84/0x2a8
          mpage_submit_page+0x34/0x98
          mpage_process_page_bufs+0x164/0x170
          mpage_prepare_extent_to_map+0x134/0x2b8
          ext4_writepages+0x484/0xe30
          do_writepages+0x44/0xe8
          __filemap_fdatawrite_range+0xbc/0x110
          file_write_and_wait_range+0x48/0xd8
          ext4_sync_file+0x80/0x4b8
          vfs_fsync_range+0x64/0xc0
          SyS_msync+0x194/0x1e8
      
      This patch fixes the problem by ensuring that READ_ONCE is used before
      the initial checks on the pmd, and this value is subsequently used when
      checking whether or not the pmd is present.  pmd_check is removed and
      the pmd_present check is inlined directly.
      
      Link: http://lkml.kernel.org/r/1507222630-5839-1-git-send-email-will.deacon@arm.com
      Fixes: f27176cf
      
       ("mm: convert page_mkclean_one() to use page_vma_mapped_walk()")
      Signed-off-by: default avatarWill Deacon <will.deacon@arm.com>
      Tested-by: default avatarYury Norov <ynorov@caviumnetworks.com>
      Tested-by: default avatarRichard Ruigrok <rruigrok@codeaurora.org>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a7b10095
    • Konstantin Khlebnikov's avatar
      kmemleak: clear stale pointers from task stacks · ca182551
      Konstantin Khlebnikov authored
      Kmemleak considers any pointers on task stacks as references.  This
      patch clears newly allocated and reused vmap stacks.
      
      Link: http://lkml.kernel.org/r/150728990124.744199.8403409836394318684.stgit@buzz
      
      
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ca182551
    • Eryu Guan's avatar
      fs/binfmt_misc.c: node could be NULL when evicting inode · 7e866006
      Eryu Guan authored
      inode->i_private is assigned by a Node pointer only after registering a
      new binary format, so it could be NULL if inode was created by
      bm_fill_super() (or iput() was called by the error path in
      bm_register_write()), and this could result in NULL pointer dereference
      when evicting such an inode.  e.g.  mount binfmt_misc filesystem then
      umount it immediately:
      
        mount -t binfmt_misc binfmt_misc /proc/sys/fs/binfmt_misc
        umount /proc/sys/fs/binfmt_misc
      
      will result in
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000013
        IP: bm_evict_inode+0x16/0x40 [binfmt_misc]
        ...
        Call Trace:
         evict+0xd3/0x1a0
         iput+0x17d/0x1d0
         dentry_unlink_inode+0xb9/0xf0
         __dentry_kill+0xc7/0x170
         shrink_dentry_list+0x122/0x280
         shrink_dcache_parent+0x39/0x90
         do_one_tree+0x12/0x40
         shrink_dcache_for_umount+0x2d/0x90
         generic_shutdown_super+0x1f/0x120
         kill_litter_super+0x29/0x40
         deactivate_locked_super+0x43/0x70
         deactivate_super+0x45/0x60
         cleanup_mnt+0x3f/0x70
         __cleanup_mnt+0x12/0x20
         task_work_run+0x86/0xa0
         exit_to_usermode_loop+0x6d/0x99
         syscall_return_slowpath+0xba/0xf0
         entry_SYSCALL_64_fastpath+0xa3/0xa
      
      Fix it by making sure Node (e) is not NULL.
      
      Link: http://lkml.kernel.org/r/20171010100642.31786-1-eguan@redhat.com
      Fixes: 83f91827
      
       ("exec: binfmt_misc: shift filp_close(interp_file) from kill_node() to bm_evict_inode()")
      Signed-off-by: default avatarEryu Guan <eguan@redhat.com>
      Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7e866006
    • Matthew Wilcox's avatar
      fs/mpage.c: fix mpage_writepage() for pages with buffers · f892760a
      Matthew Wilcox authored
      When using FAT on a block device which supports rw_page, we can hit
      BUG_ON(!PageLocked(page)) in try_to_free_buffers().  This is because we
      call clean_buffers() after unlocking the page we've written.  Introduce
      a new clean_page_buffers() which cleans all buffers associated with a
      page and call it from within bdev_write_page().
      
      [akpm@linux-foundation.org: s/PAGE_SIZE/~0U/ per Linus and Matthew]
      Link: http://lkml.kernel.org/r/20171006211541.GA7409@bombadil.infradead.org
      
      
      Signed-off-by: default avatarMatthew Wilcox <mawilcox@microsoft.com>
      Reported-by: default avatarToshi Kani <toshi.kani@hpe.com>
      Reported-by: default avatarOGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      Tested-by: default avatarToshi Kani <toshi.kani@hpe.com>
      Acked-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f892760a
    • Randy Dunlap's avatar
      linux/kernel.h: add/correct kernel-doc notation · e8c97af0
      Randy Dunlap authored
      Add kernel-doc notation for some macros.  Correct kernel-doc comments &
      typos for a few macros.
      
      Link: http://lkml.kernel.org/r/76fa1403-1511-be4c-e9c4-456b43edfad3@infradead.org
      
      
      Signed-off-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e8c97af0
    • Johannes Weiner's avatar
      tty: fall back to N_NULL if switching to N_TTY fails during hangup · e65c62b1
      Johannes Weiner authored
      We have seen NULL-pointer dereference crashes in tty->disc_data when the
      N_TTY fallback driver failed to open during hangup.  The immediate cause
      of this open to fail has been addressed in the preceding patch to
      vmalloc(), but this code could be more robust.
      
      As Alan pointed out in commit 8a8dabf2 ("tty: handle the case where
      we cannot restore a line discipline"), the N_TTY driver, historically
      the safe fallback that could never fail, can indeed fail, but the
      surrounding code is not prepared to handle this.  To avoid crashes he
      added a new N_NULL driver to take N_TTY's place as the last resort.
      
      Hook that fallback up to the hangup path.  Update tty_ldisc_reinit() to
      reflect the reality that n_tty_open can indeed fail.
      
      Link: http://lkml.kernel.org/r/20171004185959.GC2136@cmpxchg.org
      
      
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Alan Cox <alan@llwyncelyn.cymru>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e65c62b1
    • Johannes Weiner's avatar
      Revert "vmalloc: back off when the current task is killed" · b8c8a338
      Johannes Weiner authored
      This reverts commits 5d17a73a ("vmalloc: back off when the current
      task is killed") and 171012f5 ("mm: don't warn when vmalloc() fails
      due to a fatal signal").
      
      Commit 5d17a73a ("vmalloc: back off when the current task is
      killed") made all vmalloc allocations from a signal-killed task fail.
      We have seen crashes in the tty driver from this, where a killed task
      exiting tries to switch back to N_TTY, fails n_tty_open because of the
      vmalloc failing, and later crashes when dereferencing tty->disc_data.
      
      Arguably, relying on a vmalloc() call to succeed in order to properly
      exit a task is not the most robust way of doing things.  There will be a
      follow-up patch to the tty code to fall back to the N_NULL ldisc.
      
      But the justification to make that vmalloc() call fail like this isn't
      convincing, either.  The patch mentions an OOM victim exhausting the
      memory reserves and thus deadlocking the machine.  But the OOM killer is
      only one, improbable source of fatal signals.  It doesn't make sense to
      fail allocations preemptively with plenty of memory in most cases.
      
      The patch doesn't mention real-life instances where vmalloc sites would
      exhaust memory, which makes it sound more like a theoretical issue to
      begin with.  But just in case, the OOM access to memory reserves has
      been restricted on the allocator side in cd04ae1e ("mm, oom: do not
      rely on TIF_MEMDIE for memory reserves access"), which should take care
      of any theoretical concerns on that front.
      
      Revert this patch, and the follow-up that suppresses the allocation
      warnings when we fail the allocations due to a signal.
      
      Link: http://lkml.kernel.org/r/20171004185906.GB2136@cmpxchg.org
      Fixes:  171012f5
      
       ("mm: don't warn when vmalloc() fails due to a fatal signal")
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Alan Cox <alan@llwyncelyn.cymru>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b8c8a338
    • Boris Brezillon's avatar
      mm/cma.c: take __GFP_NOWARN into account in cma_alloc() · ef465014
      Boris Brezillon authored
      cma_alloc() unconditionally prints an INFO message when the CMA
      allocation fails.  Make this message conditional on the non-presence of
      __GFP_NOWARN in gfp_mask.
      
      This patch aims at removing INFO messages that are displayed when the
      VC4 driver tries to allocate buffer objects.  From the driver
      perspective an allocation failure is acceptable, and the driver can
      possibly do something to make following allocation succeed (like
      flushing the VC4 internal cache).
      
      Link: http://lkml.kernel.org/r/20171004125447.15195-1-boris.brezillon@free-electrons.com
      
      
      Signed-off-by: default avatarBoris Brezillon <boris.brezillon@free-electrons.com>
      Acked-by: default avatarLaura Abbott <labbott@redhat.com>
      Cc: Jaewon Kim <jaewon31.kim@samsung.com>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: Eric Anholt <eric@anholt.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ef465014
    • Guenter Roeck's avatar
      scripts/kallsyms.c: ignore symbol type 'n' · 51962a9d
      Guenter Roeck authored
      gcc on aarch64 may emit synbols of type 'n' if the kernel is built with
      '-frecord-gcc-switches'.  In most cases, those symbols are reported with
      nm as
      
      	000000000000000e n $d
      
      and with objdump as
      
      	0000000000000000 l    d  .GCC.command.line	0000000000000000 .GCC.command.line
      	000000000000000e l       .GCC.command.line	0000000000000000 $d
      
      Those symbols are detected in is_arm_mapping_symbol() and ignored.
      However, if "--prefix-symbols=<prefix>" is configured as well, the
      situation is different.  For example, in efi/libstub, arm64 images are
      built with
      
      	'--prefix-alloc-sections=.init --prefix-symbols=__efistub_'.
      
      In combination with '-frecord-gcc-switches', the symbols are now reported
      by nm as:
      
      	000000000000000e n __efistub_$d
      and by objdump as:
      	0000000000000000 l    d  .GCC.command.line	0000000000000000 .GCC.command.line
      	000000000000000e l       .GCC.command.line	0000000000000000 __efistub_$d
      
      Those symbols are no longer ignored and included in the base address
      calculation.  This results in a base address of 000000000000000e, which
      in turn causes kallsyms to abort with
      
          kallsyms failure:
      	relative symbol value 0xffffff900800a000 out of range in relative mode
      
      The problem is seen in little endian arm64 builds with CONFIG_EFI
      enabled and with '-frecord-gcc-switches' set in KCFLAGS.
      
      Explicitly ignore symbols of type 'n' since those are clearly debug
      symbols.
      
      Link: http://lkml.kernel.org/r/1507136063-3139-1-git-send-email-linux@roeck-us.net
      
      
      Signed-off-by: default avatarGuenter Roeck <linux@roeck-us.net>
      Acked-by: default avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      51962a9d
    • Andrea Arcangeli's avatar
      userfaultfd: selftest: exercise -EEXIST only in background transfer · 7ddd8faf
      Andrea Arcangeli authored
      I was stress testing some backports and with high load, after some time,
      the latest version of the selftest showed some false positive in
      connection with the uffdio_copy_retry.  This seems to fix it while still
      exercising -EEXIST in the background transfer once in a while.
      
      The fork child will quit after the last UFFDIO_COPY is run, so a
      repeated UFFDIO_COPY may not return -EEXIST.  This change restricts the
      -EEXIST stress to the background transfer where the memory can't go away
      from under it.
      
      Also updated uffdio_zeropage, so the interface is consistent.
      
      Link: http://lkml.kernel.org/r/20171004171541.1495-2-aarcange@redhat.com
      
      
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7ddd8faf
    • Zhen Lei's avatar
      mm: only display online cpus of the numa node · 064f0e93
      Zhen Lei authored
      When I execute numactl -H (which reads /sys/devices/system/node/nodeX/cpumap
      and displays cpumask_of_node for each node), I get different result
      on X86 and arm64.  For each numa node, the former only displayed online
      CPUs, and the latter displayed all possible CPUs.  Unfortunately, both
      Linux documentation and numactl manual have not described it clear.
      
      I sent a mail to ask for help, and Michal Hocko replied that he
      preferred to print online cpus because it doesn't really make much sense
      to bind anything on offline nodes.
      
      Will said:
       "I suspect the vast majority (if not all) code that reads this file was
        developed for x86, so having the same behaviour for arm64 sounds like
        something we should do ASAP before people try to special case with
        things like #ifdef __aarch64__. I'd rather have this in 4.14 if
        possible."
      
      Link: http://lkml.kernel.org/r/1506678805-15392-2-git-send-email-thunder.leizhen@huawei.com
      
      
      Signed-off-by: default avatarZhen Lei <thunder.leizhen@huawei.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Tianhong Ding <dingtianhong@huawei.com>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Cc: Libin <huawei.libin@huawei.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      064f0e93
    • Zi Yan's avatar
      mm: remove unnecessary WARN_ONCE in page_vma_mapped_walk(). · af0db981
      Zi Yan authored
      A non present pmd entry can appear after pmd_lock is taken in
      page_vma_mapped_walk(), even if THP migration is not enabled.  The
      WARN_ONCE is unnecessary.
      
      Link: http://lkml.kernel.org/r/20171003142606.12324-1-zi.yan@sent.com
      Fixes: 616b8371
      
       ("mm: thp: enable thp migration in generic path")
      Signed-off-by: default avatarZi Yan <zi.yan@cs.rutgers.edu>
      Reported-by: default avatarAbdul Haleem <abdhalee@linux.vnet.ibm.com>
      Tested-by: default avatarAbdul Haleem <abdhalee@linux.vnet.ibm.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      af0db981
    • Andrey Ryabinin's avatar
      mm/mempolicy: fix NUMA_INTERLEAVE_HIT counter · de55c8b2
      Andrey Ryabinin authored
      Commit 3a321d2a ("mm: change the call sites of numa statistics
      items") separated NUMA counters from zone counters, but the
      NUMA_INTERLEAVE_HIT call site wasn't updated to use the new interface.
      So alloc_page_interleave() actually increments NR_ZONE_INACTIVE_FILE
      instead of NUMA_INTERLEAVE_HIT.
      
      Fix this by using __inc_numa_state() interface to increment
      NUMA_INTERLEAVE_HIT.
      
      Link: http://lkml.kernel.org/r/20171003191003.8573-1-aryabinin@virtuozzo.com
      Fixes: 3a321d2a
      
       ("mm: change the call sites of numa statistics items")
      Signed-off-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Kemi Wang <kemi.wang@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      de55c8b2
    • Arnd Bergmann's avatar
      include/linux/of.h: provide of_n_{addr,size}_cells wrappers for !CONFIG_OF · 8a1ac5dc
      Arnd Bergmann authored
      The pci-rcar driver is enabled for compile tests, and this has shown that
      the driver cannot build without CONFIG_OF, following the inclusion of
      commit f8f2fe73 ("PCI: rcar: Use new OF interrupt mapping when possible"):
      
        drivers/pci/host/pcie-rcar.c: In function 'pci_dma_range_parser_init':
        drivers/pci/host/pcie-rcar.c:1039:2: error: implicit declaration of function 'of_n_addr_cells' [-Werror=implicit-function-declaration]
          parser->pna = of_n_addr_cells(node);
          ^
      
      As pointed out by Ben Dooks and Geert Uytterhoeven, this is actually
      supposed to build fine, which we can achieve if we make the declaration
      of of_irq_parse_and_map_pci conditional on CONFIG_OF and provide an
      empty inline function otherwise, as we do for a lot of other of
      interfaces.
      
      This lets us build the rcar_pci driver again without CONFIG_OF for build
      testing.  All platforms using this driver select OF, so this doesn't
      change anything for the users.
      
      [akpm@linux-foundation.org: be consistent with surrounding code]
      Link: http://lkml.kernel.org/r/20170911200805.3363318-1-arnd@arndb.de
      Fixes: c25da477
      
       ("PCI: rcar: Add Renesas R-Car PCIe driver")
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Reviewed-by: default avatarFrank Rowand <frank.rowand@sony.com>
      Acked-by: default avatarGeert Uytterhoeven <geert+renesas@glider.be>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Magnus Damm <damm@opensource.se>
      Cc: Ben Dooks <ben.dooks@codethink.co.uk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8a1ac5dc
    • Yang Shi's avatar
      mm/madvise.c: add description for MADV_WIPEONFORK and MADV_KEEPONFORK · c02c3009
      Yang Shi authored
      mm/madvise.c has a brief description about all MADV_ flags.  Add a
      description for the newly added MADV_WIPEONFORK and MADV_KEEPONFORK.
      
      Although man page has the similar information, but it'd better to keep
      the consistent with other flags.
      
      Link: http://lkml.kernel.org/r/1506117328-88228-1-git-send-email-yang.s@alibaba-inc.com
      
      
      Signed-off-by: default avatarYang Shi <yang.s@alibaba-inc.com>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c02c3009