Skip to content
  1. Nov 12, 2018
    • Viresh Kumar's avatar
      sched/core: Create task_has_idle_policy() helper · 1da1843f
      Viresh Kumar authored
      
      
      We already have task_has_rt_policy() and task_has_dl_policy() helpers,
      create task_has_idle_policy() as well and update sched core to start
      using it.
      
      While at it, use task_has_dl_policy() at one more place.
      
      Signed-off-by: default avatarViresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: default avatarDaniel Lezcano <daniel.lezcano@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Link: http://lkml.kernel.org/r/ce3915d5b490fc81af926a3b6bfb775e7188e005.1541416894.git.viresh.kumar@linaro.org
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      1da1843f
    • Patrick Bellasi's avatar
      sched/fair: Add lsub_positive() and use it consistently · b5c0ce7b
      Patrick Bellasi authored
      
      
      The following pattern:
      
         var -= min_t(typeof(var), var, val);
      
      is used multiple times in fair.c.
      
      The existing sub_positive() already captures that pattern, but it also
      adds an explicit load-store to properly support lockless observations.
      In other cases the pattern above is used to update local, and/or not
      concurrently accessed, variables.
      
      Let's add a simpler version of sub_positive(), targeted at local variables
      updates, which gives the same readability benefits at calling sites,
      without enforcing {READ,WRITE}_ONCE() barriers.
      
      Signed-off-by: default avatarPatrick Bellasi <patrick.bellasi@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Perret <quentin.perret@arm.com>
      Cc: Steve Muckle <smuckle@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Todd Kjos <tkjos@google.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Link: https://lore.kernel.org/lkml/20181031184527.GA3178@hirez.programming.kicks-ass.net
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      b5c0ce7b
    • Patrick Bellasi's avatar
      sched/fair: Mask UTIL_AVG_UNCHANGED usages · 92a801e5
      Patrick Bellasi authored
      
      
      The _task_util_est() is mainly used to add/remove the task contribution
      to/from the rq's estimated utilization at task enqueue/dequeue time.
      In both cases we ensure the UTIL_AVG_UNCHANGED flag is set to keep
      consistency between enqueue and dequeue time while still being
      transparent to update_load_avg calls which will eventually reset the
      flag.
      
      Let's move the flag forcing within _task_util_est() itself so that we
      can simplify calling code by hiding that estimated utilization
      implementation detail into one of its internal functions.
      
      This will affect also the "public" API task_util_est() but we know that
      the flag will (eventually) impact just on the LSB of the estimated
      utilization, thus it's certainly acceptable.
      
      Signed-off-by: default avatarPatrick Bellasi <patrick.bellasi@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Perret <quentin.perret@arm.com>
      Cc: Steve Muckle <smuckle@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Todd Kjos <tkjos@google.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Link: http://lkml.kernel.org/r/20181105145400.935-3-patrick.bellasi@arm.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      92a801e5
    • Ingo Molnar's avatar
    • Patrick Bellasi's avatar
      sched/fair: Fix cpu_util_wake() for 'execl' type workloads · c469933e
      Patrick Bellasi authored
      A ~10% regression has been reported for UnixBench's execl throughput
      test by Aaron Lu and Ye Xiaolong:
      
        https://lkml.org/lkml/2018/10/30/765
      
      That test is pretty simple, it does a "recursive" execve() syscall on the
      same binary. Starting from the syscall, this sequence is possible:
      
         do_execve()
           do_execveat_common()
             __do_execve_file()
               sched_exec()
                 select_task_rq_fair()          <==| Task already enqueued
                   find_idlest_cpu()
                     find_idlest_group()
                       capacity_spare_wake()    <==| Functions not called from
      		   cpu_util_wake()           | the wakeup path
      
      which means we can end up calling cpu_util_wake() not only from the
      "wakeup path", as its name would suggest. Indeed, the task doing an
      execve() syscall is already enqueued on the CPU we want to get the
      cpu_util_wake() for.
      
      The estimated utilization for a CPU computed in cpu_util_wake() was
      written under the assumption that function can be called only from the
      wakeup path. If instead the task is already enqueued, we end up with a
      utilization which does not remove the current task's contribution from
      the estimated utilization of the CPU.
      This will wrongly assume a reduced spare capacity on the current CPU and
      increase the chances to migrate the task on execve.
      
      The regression is tracked down to:
      
       commit d519329f ("sched/fair: Update util_est only on util_avg updates")
      
      because in that patch we turn on by default the UTIL_EST sched feature.
      However, the real issue is introduced by:
      
       commit f9be3e59
      
       ("sched/fair: Use util_est in LB and WU paths")
      
      Let's fix this by ensuring to always discount the task estimated
      utilization from the CPU's estimated utilization when the task is also
      the current one. The same benchmark of the bug report, executed on a
      dual socket 40 CPUs Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz machine,
      reports these "Execl Throughput" figures (higher the better):
      
         mainline     : 48136.5 lps
         mainline+fix : 55376.5 lps
      
      which correspond to a 15% speedup.
      
      Moreover, since {cpu_util,capacity_spare}_wake() are not really only
      used from the wakeup path, let's remove this ambiguity by using a better
      matching name: {cpu_util,capacity_spare}_without().
      
      Since we are at that, let's also improve the existing documentation.
      
      Reported-by: default avatarAaron Lu <aaron.lu@intel.com>
      Reported-by: default avatarYe Xiaolong <xiaolong.ye@intel.com>
      Tested-by: default avatarAaron Lu <aaron.lu@intel.com>
      Signed-off-by: default avatarPatrick Bellasi <patrick.bellasi@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Perret <quentin.perret@arm.com>
      Cc: Steve Muckle <smuckle@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Todd Kjos <tkjos@google.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Fixes: f9be3e59 (sched/fair: Use util_est in LB and WU paths)
      Link: https://lore.kernel.org/lkml/20181025093100.GB13236@e110439-lin/
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      c469933e
  2. Nov 05, 2018
  3. Nov 04, 2018
    • Muchun Song's avatar
      sched/core: Introduce set_next_task() helper for better code readability · ff1cdc94
      Muchun Song authored
      
      
      When we pick the next task, we will do the following for the task:
      
        1) p->se.exec_start = rq_clock_task(rq);
        2) dequeue_pushable(_dl)_task(rq, p);
      
      When we call set_curr_task(), we also need to do the same thing
      above. In rt.c, the code at 1) is in the _pick_next_task_rt()
      and the code at 2) is in the pick_next_task_rt(). If we put two
      operations in one function, maybe better. So, we introduce a new
      function set_next_task(), which is responsible for doing the above.
      
      By introducing the function we can get rid of calling the
      dequeue_pushable(_dl)_task() directly(We can call set_next_task())
      in pick_next_task() and have better code readability and reuse.
      In set_curr_task_rt(), we also can call set_next_task().
      
      Do this things such that we end up with:
      
        static struct task_struct *pick_next_task(struct rq *rq,
        					    struct task_struct *prev,
        					    struct rq_flags *rf)
        {
        	/* do something else ... */
      
        	put_prev_task(rq, prev);
      
        	/* pick next task p */
      
        	set_next_task(rq, p);
      
        	/* do something else ... */
        }
      
      put_prev_task() can match set_next_task(), which can make the
      code more readable.
      
      Signed-off-by: default avatarMuchun Song <smuchun@gmail.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20181026131743.21786-1-smuchun@gmail.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      ff1cdc94
    • Valentin Schneider's avatar
      sched/fair: Don't increase sd->balance_interval on newidle balance · 3f130a37
      Valentin Schneider authored
      When load_balance() fails to move some load because of task affinity,
      we end up increasing sd->balance_interval to delay the next periodic
      balance in the hopes that next time we look, that annoying pinned
      task(s) will be gone.
      
      However, idle_balance() pays no attention to sd->balance_interval, yet
      it will still lead to an increase in balance_interval in case of
      pinned tasks.
      
      If we're going through several newidle balances (e.g. we have a
      periodic task), this can lead to a huge increase of the
      balance_interval in a very small amount of time.
      
      To prevent that, don't increase the balance interval when going
      through a newidle balance.
      
      This is a similar approach to what is done in commit 58b26c4c
      
      
      ("sched: Increment cache_nice_tries only on periodic lb"), where we
      disregard newidle balance and rely on periodic balance for more stable
      results.
      
      Signed-off-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Dietmar.Eggemann@arm.com
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: patrick.bellasi@arm.com
      Cc: vincent.guittot@linaro.org
      Link: http://lkml.kernel.org/r/1537974727-30788-2-git-send-email-valentin.schneider@arm.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      3f130a37
    • Valentin Schneider's avatar
      sched/fair: Clean up load_balance() condition · 47b7aee1
      Valentin Schneider authored
      
      
      The alignment of the condition is off, clean that up.
      
      Also, logical operators have lower precedence than bitwise/relational
      operators, so remove one layer of parentheses to make the condition a
      bit simpler to follow.
      
      Signed-off-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Dietmar.Eggemann@arm.com
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: patrick.bellasi@arm.com
      Cc: vincent.guittot@linaro.org
      Link: http://lkml.kernel.org/r/1537974727-30788-1-git-send-email-valentin.schneider@arm.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      47b7aee1
    • Valentin Schneider's avatar
      sched/core: Take the hotplug lock in sched_init_smp() · 40fa3780
      Valentin Schneider authored
      When running on linux-next (8c60c36d0b8c ("Add linux-next specific files
      for 20181019")) + CONFIG_PROVE_LOCKING=y on a big.LITTLE system (e.g.
      Juno or HiKey960), we get the following report:
      
       [    0.748225] Call trace:
       [    0.750685]  lockdep_assert_cpus_held+0x30/0x40
       [    0.755236]  static_key_enable_cpuslocked+0x20/0xc8
       [    0.760137]  build_sched_domains+0x1034/0x1108
       [    0.764601]  sched_init_domains+0x68/0x90
       [    0.768628]  sched_init_smp+0x30/0x80
       [    0.772309]  kernel_init_freeable+0x278/0x51c
       [    0.776685]  kernel_init+0x10/0x108
       [    0.780190]  ret_from_fork+0x10/0x18
      
      The static_key in question is 'sched_asym_cpucapacity' introduced by
      commit:
      
        df054e84 ("sched/topology: Add static_key for asymmetric CPU capacity optimizations")
      
      In this particular case, we enable it because smp_prepare_cpus() will
      end up fetching the capacity-dmips-mhz entry from the devicetree,
      so we already have some asymmetry detected when entering sched_init_smp().
      
      This didn't get detected in tip/sched/core because we were missing:
      
        commit cb538267
      
       ("jump_label/lockdep: Assert we hold the hotplug lock for _cpuslocked() operations")
      
      Calls to build_sched_domains() post sched_init_smp() will hold the
      hotplug lock, it just so happens that this very first call is a
      special case. As stated by a comment in sched_init_smp(), "There's no
      userspace yet to cause hotplug operations" so this is a harmless
      warning.
      
      However, to both respect the semantics of underlying
      callees and make lockdep happy, take the hotplug lock in
      sched_init_smp(). This also satisfies the comment atop
      sched_init_domains() that says "Callers must hold the hotplug lock".
      
      Reported-by: default avatarSudeep Holla <sudeep.holla@arm.com>
      Tested-by: default avatarSudeep Holla <sudeep.holla@arm.com>
      Signed-off-by: default avatarValentin Schneider <valentin.schneider@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Dietmar.Eggemann@arm.com
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: morten.rasmussen@arm.com
      Cc: quentin.perret@arm.com
      Link: http://lkml.kernel.org/r/1540301851-3048-1-git-send-email-valentin.schneider@arm.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      40fa3780
    • Peter Zijlstra's avatar
      sched/topology: Fix off by one bug · 993f0b05
      Peter Zijlstra authored
      With the addition of the NUMA identity level, we increased @level by
      one and will run off the end of the array in the distance sort loop.
      
      Fixed: 051f3ca0
      
       ("sched/topology: Introduce NUMA identity node sched domain")
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      993f0b05
  4. Oct 29, 2018
  5. Oct 28, 2018
    • Linus Torvalds's avatar
      i2c-hid: properly terminate i2c_hid_dmi_desc_override_table[] array · b59dfdae
      Linus Torvalds authored
      Commit 9ee3e066
      
       ("HID: i2c-hid: override HID descriptors for certain
      devices") added a new dmi_system_id quirk table to override certain HID
      report descriptors for some systems that lack them.
      
      But the table wasn't properly terminated, causing the dmi matching to
      walk off into la-la-land, and starting to treat random data as dmi
      descriptor pointers, causing boot-time oopses if you were at all
      unlucky.
      
      Terminate the array.
      
      We really should have some way to just statically check that arrays that
      should be terminated by an empty entry actually are so.  But the HID
      people really should have caught this themselves, rather than have me
      deal with an oops during the merge window.  Tssk, tssk.
      
      Cc: Julian Sax <jsbc@gmx.de>
      Cc: Benjamin Tissoires <benjamin.tissoires@redhat.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b59dfdae
  6. Oct 27, 2018
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · 345671ea
      Linus Torvalds authored
      Merge updates from Andrew Morton:
      
       - a few misc things
      
       - ocfs2 updates
      
       - most of MM
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (132 commits)
        hugetlbfs: dirty pages as they are added to pagecache
        mm: export add_swap_extent()
        mm: split SWP_FILE into SWP_ACTIVATED and SWP_FS
        tools/testing/selftests/vm/map_fixed_noreplace.c: add test for MAP_FIXED_NOREPLACE
        mm: thp: relocate flush_cache_range() in migrate_misplaced_transhuge_page()
        mm: thp: fix mmu_notifier in migrate_misplaced_transhuge_page()
        mm: thp: fix MADV_DONTNEED vs migrate_misplaced_transhuge_page race condition
        mm/kasan/quarantine.c: make quarantine_lock a raw_spinlock_t
        mm/gup: cache dev_pagemap while pinning pages
        Revert "x86/e820: put !E820_TYPE_RAM regions into memblock.reserved"
        mm: return zero_resv_unavail optimization
        mm: zero remaining unavailable struct pages
        tools/testing/selftests/vm/gup_benchmark.c: add MAP_HUGETLB option
        tools/testing/selftests/vm/gup_benchmark.c: add MAP_SHARED option
        tools/testing/selftests/vm/gup_benchmark.c: allow user specified file
        tools/testing/selftests/vm/gup_benchmark.c: fix 'write' flag usage
        mm/gup_benchmark.c: add additional pinning methods
        mm/gup_benchmark.c: time put_page()
        mm: don't raise MEMCG_OOM event due to failed high-order allocation
        mm/page-writeback.c: fix range_cyclic writeback vs writepages deadlock
        ...
      345671ea
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 49040081
      Linus Torvalds authored
      Pull networking fixes from David Miller:
       "What better way to start off a weekend than with some networking bug
        fixes:
      
        1) net namespace leak in dump filtering code of ipv4 and ipv6, fixed
           by David Ahern and Bjørn Mork.
      
        2) Handle bad checksums from hardware when using CHECKSUM_COMPLETE
           properly in UDP, from Sean Tranchetti.
      
        3) Remove TCA_OPTIONS from policy validation, it turns out we don't
           consistently use nested attributes for this across all packet
           schedulers. From David Ahern.
      
        4) Fix SKB corruption in cadence driver, from Tristram Ha.
      
        5) Fix broken WoL handling in r8169 driver, from Heiner Kallweit.
      
        6) Fix OOPS in pneigh_dump_table(), from Eric Dumazet"
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (28 commits)
        net/neigh: fix NULL deref in pneigh_dump_table()
        net: allow traceroute with a specified interface in a vrf
        bridge: do not add port to router list when receives query with source 0.0.0.0
        net/smc: fix smc_buf_unuse to use the lgr pointer
        ipv6/ndisc: Preserve IPv6 control buffer if protocol error handlers are called
        net/{ipv4,ipv6}: Do not put target net if input nsid is invalid
        lan743x: Remove SPI dependency from Microchip group.
        drivers: net: remove <net/busy_poll.h> inclusion when not needed
        net: phy: genphy_10g_driver: Avoid NULL pointer dereference
        r8169: fix broken Wake-on-LAN from S5 (poweroff)
        octeontx2-af: Use GFP_ATOMIC under spin lock
        net: ethernet: cadence: fix socket buffer corruption problem
        net/ipv6: Allow onlink routes to have a device mismatch if it is the default route
        net: sched: Remove TCA_OPTIONS from policy
        ice: Poll for link status change
        ice: Allocate VF interrupts and set queue map
        ice: Introduce ice_dev_onetime_setup
        net: hns3: Fix for warning uninitialized symbol hw_err_lst3
        octeontx2-af: Copy the right amount of memory
        net: udp: fix handling of CHECKSUM_COMPLETE packets
        ...
      49040081
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc · a45dcff7
      Linus Torvalds authored
      Pull sparc fixes from David Miller:
       "Some more sparc fixups, mostly aimed at getting the allmodconfig build
        up and clean again"
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
        sparc64: Rework xchg() definition to avoid warnings.
        sparc64: Export __node_distance.
        sparc64: Make corrupted user stacks more debuggable.
      a45dcff7
    • Mike Kravetz's avatar
      hugetlbfs: dirty pages as they are added to pagecache · 22146c3c
      Mike Kravetz authored
      Some test systems were experiencing negative huge page reserve counts and
      incorrect file block counts.  This was traced to /proc/sys/vm/drop_caches
      removing clean pages from hugetlbfs file pagecaches.  When non-hugetlbfs
      explicit code removes the pages, the appropriate accounting is not
      performed.
      
      This can be recreated as follows:
       fallocate -l 2M /dev/hugepages/foo
       echo 1 > /proc/sys/vm/drop_caches
       fallocate -l 2M /dev/hugepages/foo
       grep -i huge /proc/meminfo
         AnonHugePages:         0 kB
         ShmemHugePages:        0 kB
         HugePages_Total:    2048
         HugePages_Free:     2047
         HugePages_Rsvd:    18446744073709551615
         HugePages_Surp:        0
         Hugepagesize:       2048 kB
         Hugetlb:         4194304 kB
       ls -lsh /dev/hugepages/foo
         4.0M -rw-r--r--. 1 root root 2.0M Oct 17 20:05 /dev/hugepages/foo
      
      To address this issue, dirty pages as they are added to pagecache.  This
      can easily be reproduced with fallocate as shown above.  Read faulted
      pages will eventually end up being marked dirty.  But there is a window
      where they are clean and could be impacted by code such as drop_caches.
      So, just dirty them all as they are added to the pagecache.
      
      Link: http://lkml.kernel.org/r/b5be45b8-5afe-56cd-9482-28384699a049@oracle.com
      Fixes: 6bda666a
      
       ("hugepages: fold find_or_alloc_pages into huge_no_page()")
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarMihcla Hocko <mhocko@suse.com>
      Reviewed-by: default avatarKhalid Aziz <khalid.aziz@oracle.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      22146c3c
    • Omar Sandoval's avatar
      mm: export add_swap_extent() · aa8aa8a3
      Omar Sandoval authored
      Btrfs currently does not support swap files because swap's use of bmap
      does not work with copy-on-write and multiple devices.  See 35054394
      ("Btrfs: stop providing a bmap operation to avoid swapfile corruptions").
      
      However, the swap code has a mechanism for the filesystem to manually add
      swap extents using add_swap_extent() from the ->swap_activate() aop.
      iomap has done this since 67482129 ("iomap: add a swapfile activation
      function").  Btrfs will do the same in a later patch, so export
      add_swap_extent().
      
      Link: http://lkml.kernel.org/r/bb1208575e02829aae51b538709476964f97b1ea.1536704650.git.osandov@fb.com
      
      
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: David Sterba <dsterba@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Nikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      aa8aa8a3
    • Omar Sandoval's avatar
      mm: split SWP_FILE into SWP_ACTIVATED and SWP_FS · bc4ae27d
      Omar Sandoval authored
      The SWP_FILE flag serves two purposes: to make swap_{read,write}page() go
      through the filesystem, and to make swapoff() call ->swap_deactivate().
      For Btrfs, we want the latter but not the former, so split this flag into
      two.  This makes us always call ->swap_deactivate() if ->swap_activate()
      succeeded, not just if it didn't add any swap extents itself.
      
      This also resolves the issue of the very misleading name of SWP_FILE,
      which is only used for swap files over NFS.
      
      Link: http://lkml.kernel.org/r/6d63d8668c4287a4f6d203d65696e96f80abdfc7.1536704650.git.osandov@fb.com
      
      
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: David Sterba <dsterba@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bc4ae27d
    • Michael Ellerman's avatar
      tools/testing/selftests/vm/map_fixed_noreplace.c: add test for MAP_FIXED_NOREPLACE · 91cbacc3
      Michael Ellerman authored
      Add a test for MAP_FIXED_NOREPLACE, based on some code originally by Jann
      Horn.  This would have caught the overlap bug reported by Daniel Micay.
      
      I originally suggested to Michal that we create MAP_FIXED_NOREPLACE, but
      instead of writing a selftest I spent my time bike-shedding whether it
      should be called MAP_FIXED_SAFE/NOCLOBBER/WEAK/NEW ..  mea culpa.
      
      Link: http://lkml.kernel.org/r/20181013133929.28653-1-mpe@ellerman.id.au
      
      
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Reviewed-by: default avatarKhalid Aziz <khalid.aziz@oracle.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Florian Weimer <fweimer@redhat.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
      Cc: Joel Stanley <joel@jms.id.au>
      Cc: Jason Evans <jasone@google.com>
      Cc: David Goldblatt <davidtgoldblatt@gmail.com>
      Cc: Daniel Micay <danielmicay@gmail.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      91cbacc3
    • Andrea Arcangeli's avatar
      mm: thp: relocate flush_cache_range() in migrate_misplaced_transhuge_page() · 7eef5f97
      Andrea Arcangeli authored
      There should be no cache left by the time we overwrite the old transhuge
      pmd with the new one.  It's already too late to flush through the virtual
      address because we already copied the page data to the new physical
      address.
      
      So flush the cache before the data copy.
      
      Also delete the "end" variable to shutoff a "unused variable" warning on
      x86 where flush_cache_range() is a noop.
      
      Link: http://lkml.kernel.org/r/20181015202311.7209-1-aarcange@redhat.com
      
      
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Aaron Tomlin <atomlin@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7eef5f97
    • Andrea Arcangeli's avatar
      mm: thp: fix mmu_notifier in migrate_misplaced_transhuge_page() · 7066f0f9
      Andrea Arcangeli authored
      change_huge_pmd() after arming the numa/protnone pmd doesn't flush the TLB
      right away.  do_huge_pmd_numa_page() flushes the TLB before calling
      migrate_misplaced_transhuge_page().  By the time do_huge_pmd_numa_page()
      runs some CPU could still access the page through the TLB.
      
      change_huge_pmd() before arming the numa/protnone transhuge pmd calls
      mmu_notifier_invalidate_range_start().  So there's no need of
      mmu_notifier_invalidate_range_start()/mmu_notifier_invalidate_range_only_end()
      sequence in migrate_misplaced_transhuge_page() too, because by the time
      migrate_misplaced_transhuge_page() runs, the pmd mapping has already been
      invalidated in the secondary MMUs.  It has to or if a secondary MMU can
      still write to the page, the migrate_page_copy() would lose data.
      
      However an explicit mmu_notifier_invalidate_range() is needed before
      migrate_misplaced_transhuge_page() starts copying the data of the
      transhuge page or the below can happen for MMU notifier users sharing the
      primary MMU pagetables and only implementing ->invalidate_range:
      
      CPU0		CPU1		GPU sharing linux pagetables using
                                      only ->invalidate_range
      -----------	------------	---------
      				GPU secondary MMU writes to the page
      				mapped by the transhuge pmd
      change_pmd_range()
      mmu..._range_start()
      ->invalidate_range_start() noop
      change_huge_pmd()
      set_pmd_at(numa/protnone)
      pmd_unlock()
      		do_huge_pmd_numa_page()
      		CPU TLB flush globally (1)
      		CPU cannot write to page
      		migrate_misplaced_transhuge_page()
      				GPU writes to the page...
      		migrate_page_copy()
      				...GPU stops writing to the page
      CPU TLB flush (2)
      mmu..._range_end() (3)
      ->invalidate_range_stop() noop
      ->invalidate_range()
      				GPU secondary MMU is invalidated
      				and cannot write to the page anymore
      				(too late)
      
      Just like we need a CPU TLB flush (1) because the TLB flush (2) arrives
      too late, we also need a mmu_notifier_invalidate_range() before calling
      migrate_misplaced_transhuge_page(), because the ->invalidate_range() in
      (3) also arrives too late.
      
      This requirement is the result of the lazy optimization in
      change_huge_pmd() that releases the pmd_lock without first flushing the
      TLB and without first calling mmu_notifier_invalidate_range().
      
      Even converting the removed mmu_notifier_invalidate_range_only_end() into
      a mmu_notifier_invalidate_range_end() would not have been enough to fix
      this, because it run after migrate_page_copy().
      
      After the hugepage data copy is done migrate_misplaced_transhuge_page()
      can proceed and call set_pmd_at without having to flush the TLB nor any
      secondary MMUs because the secondary MMU invalidate, just like the CPU TLB
      flush, has to happen before the migrate_page_copy() is called or it would
      be a bug in the first place (and it was for drivers using
      ->invalidate_range()).
      
      KVM is unaffected because it doesn't implement ->invalidate_range().
      
      The standard PAGE_SIZEd migrate_misplaced_page is less accelerated and
      uses the generic migrate_pages which transitions the pte from
      numa/protnone to a migration entry in try_to_unmap_one() and flushes TLBs
      and all mmu notifiers there before copying the page.
      
      Link: http://lkml.kernel.org/r/20181013002430.698-3-aarcange@redhat.com
      
      
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarAaron Tomlin <atomlin@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7066f0f9
    • Andrea Arcangeli's avatar
      mm: thp: fix MADV_DONTNEED vs migrate_misplaced_transhuge_page race condition · d7c33934
      Andrea Arcangeli authored
      Patch series "migrate_misplaced_transhuge_page race conditions".
      
      Aaron found a new instance of the THP MADV_DONTNEED race against
      pmdp_clear_flush* variants, that was apparently left unfixed.
      
      While looking into the race found by Aaron, I may have found two more
      issues in migrate_misplaced_transhuge_page.
      
      These race conditions would not cause kernel instability, but they'd
      corrupt userland data or leave data non zero after MADV_DONTNEED.
      
      I did only minor testing, and I don't expect to be able to reproduce this
      (especially the lack of ->invalidate_range before migrate_page_copy,
      requires the latest iommu hardware or infiniband to reproduce).  The last
      patch is noop for x86 and it needs further review from maintainers of
      archs that implement flush_cache_range() (not in CC yet).
      
      To avoid confusion, it's not the first patch that introduces the bug fixed
      in the second patch, even before removing the
      pmdp_huge_clear_flush_notify, that _notify suffix was called after
      migrate_page_copy already run.
      
      This patch (of 3):
      
      This is a corollary of ced10803 ("thp: fix MADV_DONTNEED vs.  numa
      balancing race"), 58ceeb6b ("thp: fix MADV_DONTNEED vs.  MADV_FREE
      race") and 5b7abeae ("thp: fix MADV_DONTNEED vs clear soft dirty
      race).
      
      When the above three fixes where posted Dave asked
      https://lkml.kernel.org/r/929b3844-aec2-0111-fef7-8002f9d4e2b9@intel.com
      but apparently this was missed.
      
      The pmdp_clear_flush* in migrate_misplaced_transhuge_page() was introduced
      in a54a407f ("mm: Close races between THP migration and PMD numa
      clearing").
      
      The important part of such commit is only the part where the page lock is
      not released until the first do_huge_pmd_numa_page() finished disarming
      the pagenuma/protnone.
      
      The addition of pmdp_clear_flush() wasn't beneficial to such commit and
      there's no commentary about such an addition either.
      
      I guess the pmdp_clear_flush() in such commit was added just in case for
      safety, but it ended up introducing the MADV_DONTNEED race condition found
      by Aaron.
      
      At that point in time nobody thought of such kind of MADV_DONTNEED race
      conditions yet (they were fixed later) so the code may have looked more
      robust by adding the pmdp_clear_flush().
      
      This specific race condition won't destabilize the kernel, but it can
      confuse userland because after MADV_DONTNEED the memory won't be zeroed
      out.
      
      This also optimizes the code and removes a superfluous TLB flush.
      
      [akpm@linux-foundation.org: reflow comment to 80 cols, fix grammar and typo (beacuse)]
      Link: http://lkml.kernel.org/r/20181013002430.698-2-aarcange@redhat.com
      
      
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: default avatarAaron Tomlin <atomlin@redhat.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d7c33934
    • Clark Williams's avatar
      mm/kasan/quarantine.c: make quarantine_lock a raw_spinlock_t · 026d1eaf
      Clark Williams authored
      The static lock quarantine_lock is used in quarantine.c to protect the
      quarantine queue datastructures.  It is taken inside quarantine queue
      manipulation routines (quarantine_put(), quarantine_reduce() and
      quarantine_remove_cache()), with IRQs disabled.  This is not a problem on
      a stock kernel but is problematic on an RT kernel where spin locks are
      sleeping spinlocks, which can sleep and can not be acquired with disabled
      interrupts.
      
      Convert the quarantine_lock to a raw spinlock_t.  The usage of
      quarantine_lock is confined to quarantine.c and the work performed while
      the lock is held is used for debug purpose.
      
      [bigeasy@linutronix.de: slightly altered the commit message]
      Link: http://lkml.kernel.org/r/20181010214945.5owshc3mlrh74z4b@linutronix.de
      
      
      Signed-off-by: default avatarClark Williams <williams@redhat.com>
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Acked-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Acked-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      026d1eaf
    • Keith Busch's avatar
      mm/gup: cache dev_pagemap while pinning pages · df06b37f
      Keith Busch authored
      Getting pages from ZONE_DEVICE memory needs to check the backing device's
      live-ness, which is tracked in the device's dev_pagemap metadata.  This
      metadata is stored in a radix tree and looking it up adds measurable
      software overhead.
      
      This patch avoids repeating this relatively costly operation when
      dev_pagemap is used by caching the last dev_pagemap while getting user
      pages.  The gup_benchmark kernel self test reports this reduces time to
      get user pages to as low as 1/3 of the previous time.
      
      Link: http://lkml.kernel.org/r/20181012173040.15669-1-keith.busch@intel.com
      
      
      Signed-off-by: default avatarKeith Busch <keith.busch@intel.com>
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      df06b37f
    • Masayoshi Mizuma's avatar
      Revert "x86/e820: put !E820_TYPE_RAM regions into memblock.reserved" · 9fd61bc9
      Masayoshi Mizuma authored
      commit 124049de ("x86/e820: put !E820_TYPE_RAM regions into
      memblock.reserved") breaks movable_node kernel option because it changed
      the memory gap range to reserved memblock.  So, the node is marked as
      Normal zone even if the SRAT has Hot pluggable affinity.
      
          =====================================================================
          kernel: BIOS-e820: [mem 0x0000180000000000-0x0000180fffffffff] usable
          kernel: BIOS-e820: [mem 0x00001c0000000000-0x00001c0fffffffff] usable
          ...
          kernel: reserved[0x12]#011[0x0000181000000000-0x00001bffffffffff], 0x000003f000000000 bytes flags: 0x0
          ...
          kernel: ACPI: SRAT: Node 2 PXM 6 [mem 0x180000000000-0x1bffffffffff] hotplug
          kernel: ACPI: SRAT: Node 3 PXM 7 [mem 0x1c0000000000-0x1fffffffffff] hotplug
          ...
          kernel: Movable zone start for each node
          kernel:  Node 3: 0x00001c0000000000
          kernel: Early memory node ranges
          ...
          =====================================================================
      
      The original issue is fixed by the former patches, so let's revert commit
      124049de ("x86/e820: put !E820_TYPE_RAM regions into
      memblock.reserved").
      
      Link: http://lkml.kernel.org/r/20181002143821.5112-4-msys.mizuma@gmail.com
      
      
      Signed-off-by: default avatarMasayoshi Mizuma <m.mizuma@jp.fujitsu.com>
      Reviewed-by: default avatarPavel Tatashin <pavel.tatashin@microsoft.com>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9fd61bc9
    • Pavel Tatashin's avatar
      mm: return zero_resv_unavail optimization · ec393a0f
      Pavel Tatashin authored
      When checking for valid pfns in zero_resv_unavail(), it is not necessary
      to verify that pfns within pageblock_nr_pages ranges are valid, only the
      first one needs to be checked.  This is because memory for pages are
      allocated in contiguous chunks that contain pageblock_nr_pages struct
      pages.
      
      Link: http://lkml.kernel.org/r/20181002143821.5112-3-msys.mizuma@gmail.com
      
      
      Signed-off-by: default avatarPavel Tatashin <pavel.tatashin@microsoft.com>
      Signed-off-by: default avatarMasayoshi Mizuma <m.mizuma@jp.fujitsu.com>
      Reviewed-by: default avatarMasayoshi Mizuma <m.mizuma@jp.fujitsu.com>
      Acked-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ec393a0f
    • Naoya Horiguchi's avatar
      mm: zero remaining unavailable struct pages · 907ec5fc
      Naoya Horiguchi authored
      Patch series "mm: Fix for movable_node boot option", v3.
      
      This patch series contains a fix for the movable_node boot option issue
      which was introduced by commit 124049de ("x86/e820: put !E820_TYPE_RAM
      regions into memblock.reserved").
      
      The commit breaks the option because it changed the memory gap range to
      reserved memblock.  So, the node is marked as Normal zone even if the SRAT
      has Hot pluggable affinity.
      
      First and second patch fix the original issue which the commit tried to
      fix, then revert the commit.
      
      This patch (of 3):
      
      There is a kernel panic that is triggered when reading /proc/kpageflags on
      the kernel booted with kernel parameter 'memmap=nn[KMG]!ss[KMG]':
      
        BUG: unable to handle kernel paging request at fffffffffffffffe
        PGD 9b20e067 P4D 9b20e067 PUD 9b210067 PMD 0
        Oops: 0000 [#1] SMP PTI
        CPU: 2 PID: 1728 Comm: page-types Not tainted 4.17.0-rc6-mm1-v4.17-rc6-180605-0816-00236-g2dfb086ef02c+ #160
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.fc28 04/01/2014
        RIP: 0010:stable_page_flags+0x27/0x3c0
        Code: 00 00 00 0f 1f 44 00 00 48 85 ff 0f 84 a0 03 00 00 41 54 55 49 89 fc 53 48 8b 57 08 48 8b 2f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 f6 c4 01 0f 84 10 03 00 00 31 db 49 8b 54 24 08 4c 89 e7
        RSP: 0018:ffffbbd44111fde0 EFLAGS: 00010202
        RAX: fffffffffffffffe RBX: 00007fffffffeff9 RCX: 0000000000000000
        RDX: 0000000000000001 RSI: 0000000000000202 RDI: ffffed1182fff5c0
        RBP: ffffffffffffffff R08: 0000000000000001 R09: 0000000000000001
        R10: ffffbbd44111fed8 R11: 0000000000000000 R12: ffffed1182fff5c0
        R13: 00000000000bffd7 R14: 0000000002fff5c0 R15: ffffbbd44111ff10
        FS:  00007efc4335a500(0000) GS:ffff93a5bfc00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: fffffffffffffffe CR3: 00000000b2a58000 CR4: 00000000001406e0
        Call Trace:
         kpageflags_read+0xc7/0x120
         proc_reg_read+0x3c/0x60
         __vfs_read+0x36/0x170
         vfs_read+0x89/0x130
         ksys_pread64+0x71/0x90
         do_syscall_64+0x5b/0x160
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7efc42e75e23
        Code: 09 00 ba 9f 01 00 00 e8 ab 81 f4 ff 66 2e 0f 1f 84 00 00 00 00 00 90 83 3d 29 0a 2d 00 00 75 13 49 89 ca b8 11 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 db d3 01 00 48 89 04 24
      
      According to kernel bisection, this problem became visible due to commit
      f7f99100 which changes how struct pages are initialized.
      
      Memblock layout affects the pfn ranges covered by node/zone.  Consider
      that we have a VM with 2 NUMA nodes and each node has 4GB memory, and the
      default (no memmap= given) memblock layout is like below:
      
        MEMBLOCK configuration:
         memory size = 0x00000001fff75c00 reserved size = 0x000000000300c000
         memory.cnt  = 0x4
         memory[0x0]     [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
         memory[0x1]     [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
         memory[0x2]     [0x0000000100000000-0x000000013fffffff], 0x0000000040000000 bytes on node 0 flags: 0x0
         memory[0x3]     [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
         ...
      
      If you give memmap=1G!4G (so it just covers memory[0x2]),
      the range [0x100000000-0x13fffffff] is gone:
      
        MEMBLOCK configuration:
         memory size = 0x00000001bff75c00 reserved size = 0x000000000300c000
         memory.cnt  = 0x3
         memory[0x0]     [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
         memory[0x1]     [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
         memory[0x2]     [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
         ...
      
      This causes shrinking node 0's pfn range because it is calculated by the
      address range of memblock.memory.  So some of struct pages in the gap
      range are left uninitialized.
      
      We have a function zero_resv_unavail() which does zeroing the struct pages
      outside memblock.memory, but currently it covers only the reserved
      unavailable range (i.e.  memblock.memory && !memblock.reserved).  This
      patch extends it to cover all unavailable range, which fixes the reported
      issue.
      
      Link: http://lkml.kernel.org/r/20181002143821.5112-2-msys.mizuma@gmail.com
      Fixes: f7f99100
      
       ("mm: stop zeroing memory during allocation in vmemmap")
      Signed-off-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by-by: default avatarMasayoshi Mizuma <m.mizuma@jp.fujitsu.com>
      Tested-by: default avatarOscar Salvador <osalvador@suse.de>
      Tested-by: default avatarMasayoshi Mizuma <m.mizuma@jp.fujitsu.com>
      Reviewed-by: default avatarPavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      907ec5fc
    • Keith Busch's avatar
      tools/testing/selftests/vm/gup_benchmark.c: add MAP_HUGETLB option · 3821b76c
      Keith Busch authored
      Add a new option '-H' to the gup benchmark to help understand how hugetlb
      mapping pages compare with the default.
      
      Link: http://lkml.kernel.org/r/20181010195605.10689-6-keith.busch@intel.com
      
      
      Signed-off-by: default avatarKeith Busch <keith.busch@intel.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3821b76c
    • Keith Busch's avatar
      tools/testing/selftests/vm/gup_benchmark.c: add MAP_SHARED option · 0dd8666a
      Keith Busch authored
      Add a new benchmark option, -S, to request MAP_SHARED.  This can be used
      to compare with MAP_PRIVATE, or for files that require this option, like
      dax.
      
      Link: http://lkml.kernel.org/r/20181010195605.10689-5-keith.busch@intel.com
      
      
      Signed-off-by: default avatarKeith Busch <keith.busch@intel.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0dd8666a
    • Keith Busch's avatar
      tools/testing/selftests/vm/gup_benchmark.c: allow user specified file · aeb85ed4
      Keith Busch authored
      Allow a user to specify a file to map by adding a new option, '-f',
      providing a means to test various file backings.
      
      If not specified, the benchmark will use a private mapping of /dev/zero,
      which produces an anonymous mapping as before.
      
      [akpm@linux-foundation.org: avoid using comma operator]
      Link: http://lkml.kernel.org/r/20181010195605.10689-4-keith.busch@intel.com
      
      
      Signed-off-by: default avatarKeith Busch <keith.busch@intel.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      aeb85ed4
    • Keith Busch's avatar
      tools/testing/selftests/vm/gup_benchmark.c: fix 'write' flag usage · 319e0bec
      Keith Busch authored
      If the '-w' parameter was provided, the benchmark would exit due to a
      mssing 'break'.
      
      Link: http://lkml.kernel.org/r/20181010195605.10689-3-keith.busch@intel.com
      
      
      Signed-off-by: default avatarKeith Busch <keith.busch@intel.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      319e0bec
    • Keith Busch's avatar
      mm/gup_benchmark.c: add additional pinning methods · 714a3a1e
      Keith Busch authored
      Provide new gup benchmark ioctl commands to run different user page
      pinning methods, get_user_pages_longterm() and get_user_pages(), in
      addition to the existing get_user_pages_fast().
      
      Link: http://lkml.kernel.org/r/20181010195605.10689-2-keith.busch@intel.com
      
      
      Signed-off-by: default avatarKeith Busch <keith.busch@intel.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      714a3a1e
    • Keith Busch's avatar
      mm/gup_benchmark.c: time put_page() · 26db3d09
      Keith Busch authored
      We'd like to measure time to unpin user pages, so this adds a second
      benchmark timer on put_page, separate from get_page.
      
      Adding the field breaks this ioctl ABI, but should be okay since this an
      in-tree kernel selftest.
      
      [akpm@linux-foundation.org: add expansion to struct gup_benchmark for future use]
      Link: http://lkml.kernel.org/r/20181010195605.10689-1-keith.busch@intel.com
      
      
      Signed-off-by: default avatarKeith Busch <keith.busch@intel.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      26db3d09
    • Roman Gushchin's avatar
      mm: don't raise MEMCG_OOM event due to failed high-order allocation · 7a1adfdd
      Roman Gushchin authored
      It was reported that on some of our machines containers were restarted
      with OOM symptoms without an obvious reason.  Despite there were almost no
      memory pressure and plenty of page cache, MEMCG_OOM event was raised
      occasionally, causing the container management software to think, that OOM
      has happened.  However, no tasks have been killed.
      
      The following investigation showed that the problem is caused by a failing
      attempt to charge a high-order page.  In such case, the OOM killer is
      never invoked.  As shown below, it can happen under conditions, which are
      very far from a real OOM: e.g.  there is plenty of clean page cache and no
      memory pressure.
      
      There is no sense in raising an OOM event in this case, as it might
      confuse a user and lead to wrong and excessive actions (e.g.  restart the
      workload, as in my case).
      
      Let's look at the charging path in try_charge().  If the memory usage is
      about memory.max, which is absolutely natural for most memory cgroups, we
      try to reclaim some pages.  Even if we were able to reclaim enough memory
      for the allocation, the following check can fail due to a race with
      another concurrent allocation:
      
          if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
              goto retry;
      
      For regular pages the following condition will save us from triggering
      the OOM:
      
         if (nr_reclaimed && nr_pages <= (1 << PAGE_ALLOC_COSTLY_ORDER))
             goto retry;
      
      But for high-order allocation this condition will intentionally fail.  The
      reason behind is that we'll likely fall to regular pages anyway, so it's
      ok and even preferred to return ENOMEM.
      
      In this case the idea of raising MEMCG_OOM looks dubious.
      
      Fix this by moving MEMCG_OOM raising to mem_cgroup_oom() after allocation
      order check, so that the event won't be raised for high order allocations.
      This change doesn't affect regular pages allocation and charging.
      
      Link: http://lkml.kernel.org/r/20181004214050.7417-1-guro@fb.com
      
      
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@kernel.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7a1adfdd
    • Dave Chinner's avatar
      mm/page-writeback.c: fix range_cyclic writeback vs writepages deadlock · 64081362
      Dave Chinner authored
      We've recently seen a workload on XFS filesystems with a repeatable
      deadlock between background writeback and a multi-process application
      doing concurrent writes and fsyncs to a small range of a file.
      
      range_cyclic
      writeback		Process 1		Process 2
      
      xfs_vm_writepages
        write_cache_pages
          writeback_index = 2
          cycled = 0
          ....
          find page 2 dirty
          lock Page 2
          ->writepage
            page 2 writeback
            page 2 clean
            page 2 added to bio
          no more pages
      			write()
      			locks page 1
      			dirties page 1
      			locks page 2
      			dirties page 1
      			fsync()
      			....
      			xfs_vm_writepages
      			write_cache_pages
      			  start index 0
      			  find page 1 towrite
      			  lock Page 1
      			  ->writepage
      			    page 1 writeback
      			    page 1 clean
      			    page 1 added to bio
      			  find page 2 towrite
      			  lock Page 2
      			  page 2 is writeback
      			  <blocks>
      						write()
      						locks page 1
      						dirties page 1
      						fsync()
      						....
      						xfs_vm_writepages
      						write_cache_pages
      						  start index 0
      
          !done && !cycled
            sets index to 0, restarts lookup
          find page 1 dirty
      						  find page 1 towrite
      						  lock Page 1
      						  page 1 is writeback
      						  <blocks>
      
          lock Page 1
          <blocks>
      
      DEADLOCK because:
      
      	- process 1 needs page 2 writeback to complete to make
      	  enough progress to issue IO pending for page 1
      	- writeback needs page 1 writeback to complete so process 2
      	  can progress and unlock the page it is blocked on, then it
      	  can issue the IO pending for page 2
      	- process 2 can't make progress until process 1 issues IO
      	  for page 1
      
      The underlying cause of the problem here is that range_cyclic writeback is
      processing pages in descending index order as we hold higher index pages
      in a structure controlled from above write_cache_pages().  The
      write_cache_pages() caller needs to be able to submit these pages for IO
      before write_cache_pages restarts writeback at mapping index 0 to avoid
      wcp inverting the page lock/writeback wait order.
      
      generic_writepages() is not susceptible to this bug as it has no private
      context held across write_cache_pages() - filesystems using this
      infrastructure always submit pages in ->writepage immediately and so there
      is no problem with range_cyclic going back to mapping index 0.
      
      However:
      	mpage_writepages() has a private bio context,
      	exofs_writepages() has page_collect
      	fuse_writepages() has fuse_fill_wb_data
      	nfs_writepages() has nfs_pageio_descriptor
      	xfs_vm_writepages() has xfs_writepage_ctx
      
      All of these ->writepages implementations can hold pages under writeback
      in their private structures until write_cache_pages() returns, and hence
      they are all susceptible to this deadlock.
      
      Also worth noting is that ext4 has it's own bastardised version of
      write_cache_pages() and so it /may/ have an equivalent deadlock.  I looked
      at the code long enough to understand that it has a similar retry loop for
      range_cyclic writeback reaching the end of the file and then promptly ran
      away before my eyes bled too much.  I'll leave it for the ext4 developers
      to determine if their code is actually has this deadlock and how to fix it
      if it has.
      
      There's a few ways I can see avoid this deadlock.  There's probably more,
      but these are the first I've though of:
      
      1. get rid of range_cyclic altogether
      
      2. range_cyclic always stops at EOF, and we start again from
      writeback index 0 on the next call into write_cache_pages()
      
      2a. wcp also returns EAGAIN to ->writepages implementations to
      indicate range cyclic has hit EOF. writepages implementations can
      then flush the current context and call wpc again to continue. i.e.
      lift the retry into the ->writepages implementation
      
      3. range_cyclic uses trylock_page() rather than lock_page(), and it
      skips pages it can't lock without blocking. It will already do this
      for pages under writeback, so this seems like a no-brainer
      
      3a. all non-WB_SYNC_ALL writeback uses trylock_page() to avoid
      blocking as per pages under writeback.
      
      I don't think #1 is an option - range_cyclic prevents frequently
      dirtied lower file offset from starving background writeback of
      rarely touched higher file offsets.
      
      #2 is simple, and I don't think it will have any impact on
      performance as going back to the start of the file implies an
      immediate seek. We'll have exactly the same number of seeks if we
      switch writeback to another inode, and then come back to this one
      later and restart from index 0.
      
      #2a is pretty much "status quo without the deadlock". Moving the
      retry loop up into the wcp caller means we can issue IO on the
      pending pages before calling wcp again, and so avoid locking or
      waiting on pages in the wrong order. I'm not convinced we need to do
      this given that we get the same thing from #2 on the next writeback
      call from the writeback infrastructure.
      
      #3 is really just a band-aid - it doesn't fix the access/wait
      inversion problem, just prevents it from becoming a deadlock
      situation. I'd prefer we fix the inversion, not sweep it under the
      carpet like this.
      
      #3a is really an optimisation that just so happens to include the
      band-aid fix of #3.
      
      So it seems that the simplest way to fix this issue is to implement
      solution #2
      
      Link: http://lkml.kernel.org/r/20181005054526.21507-1-david@fromorbit.com
      
      
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarJan Kara <jack@suse.de>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      64081362
    • Pavel Tatashin's avatar
      mm: move mirrored memory specific code outside of memmap_init_zone · a9a9e77f
      Pavel Tatashin authored
      memmap_init_zone, is getting complex, because it is called from different
      contexts: hotplug, and during boot, and also because it must handle some
      architecture quirks.  One of them is mirrored memory.
      
      Move the code that decides whether to skip mirrored memory outside of
      memmap_init_zone, into a separate function.
      
      [pasha.tatashin@oracle.com: uninline overlap_memmap_init()]
        Link: http://lkml.kernel.org/r/20180726193509.3326-4-pasha.tatashin@oracle.com
      Link: http://lkml.kernel.org/r/20180724235520.10200-4-pasha.tatashin@oracle.com
      
      
      Signed-off-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
      Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a9a9e77f
    • Pavel Tatashin's avatar
      mm: calculate deferred pages after skipping mirrored memory · d3035be4
      Pavel Tatashin authored
      update_defer_init() should be called only when struct page is about to be
      initialized. Because it counts number of initialized struct pages, but
      there we may skip struct pages if there is some mirrored memory.
      
      So move, update_defer_init() after checking for mirrored memory.
      
      Also, rename update_defer_init() to defer_init() and reverse the return
      boolean to emphasize that this is a boolean function, that tells that the
      reset of memmap initialization should be deferred.
      
      Make this function self-contained: do not pass number of already
      initialized pages in this zone by using static counters.
      
      I found this bug by reading the code.  The effect is that fewer than
      expected struct pages are initialized early in boot, and it is possible
      that in some corner cases we may fail to boot when mirrored pages are
      used.  The deferred on demand code should somewhat mitigate this.  But
      this still brings some inconsistencies compared to when booting without
      mirrored pages, so it is better to fix.
      
      [pasha.tatashin@oracle.com: add comment about defer_init's lack of locking]
        Link: http://lkml.kernel.org/r/20180726193509.3326-3-pasha.tatashin@oracle.com
      [akpm@linux-foundation.org: make defer_init non-inline, __meminit]
      Link: http://lkml.kernel.org/r/20180724235520.10200-3-pasha.tatashin@oracle.com
      
      
      Signed-off-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d3035be4
    • Pavel Tatashin's avatar
      mm: make memmap_init a proper function · dfb3ccd0
      Pavel Tatashin authored
      memmap_init is sometimes a macro sometimes a function based on
      __HAVE_ARCH_MEMMAP_INIT.  It is only a function on ia64.  Make memmap_init
      a weak function instead, and let ia64 redefine it.
      
      Link: http://lkml.kernel.org/r/20180724235520.10200-2-pasha.tatashin@oracle.com
      
      
      Signed-off-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dfb3ccd0
    • Kirill Tkhai's avatar
      mm/memcontrol.c: convert mem_cgroup_id::ref to refcount_t type · 1c2d479a
      Kirill Tkhai authored
      This will allow to use generic refcount_t interfaces to check counters
      overflow instead of currently existing VM_BUG_ON().  The only difference
      after the patch is VM_BUG_ON() may cause BUG(), while refcount_t fires
      with WARN().  But this seems not to be significant here, since such the
      problems are usually caught by syzbot with panic-on-warn enabled.
      
      Link: http://lkml.kernel.org/r/153910718919.7006.13400779039257185427.stgit@localhost.localdomain
      
      
      Signed-off-by: default avatarKirill Tkhai <ktkhai@virtuozzo.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Andrea Parri <andrea.parri@amarulasolutions.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1c2d479a