Skip to content
  1. May 20, 2016
    • Hugh Dickins's avatar
      mm: /proc/sys/vm/stat_refresh to force vmstat update · 52b6f46b
      Hugh Dickins authored
      
      
      Provide /proc/sys/vm/stat_refresh to force an immediate update of
      per-cpu into global vmstats: useful to avoid a sleep(2) or whatever
      before checking counts when testing.  Originally added to work around a
      bug which left counts stranded indefinitely on a cpu going idle (an
      inaccuracy magnified when small below-batch numbers represent "huge"
      amounts of memory), but I believe that bug is now fixed: nonetheless,
      this is still a useful knob.
      
      Its schedule_on_each_cpu() is probably too expensive just to fold into
      reading /proc/meminfo itself: give this mode 0600 to prevent abuse.
      Allow a write or a read to do the same: nothing to read, but "grep -h
      Shmem /proc/sys/vm/stat_refresh /proc/meminfo" is convenient.  Oh, and
      since global_page_state() itself is careful to disguise any underflow as
      0, hack in an "Invalid argument" and pr_warn() if a counter is negative
      after the refresh - this helped to fix a misaccounting of
      NR_ISOLATED_FILE in my migration code.
      
      But on recent kernels, I find that NR_ALLOC_BATCH and NR_PAGES_SCANNED
      often go negative some of the time.  I have not yet worked out why, but
      have no evidence that it's actually harmful.  Punt for the moment by
      just ignoring the anomaly on those.
      
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Yang Shi <yang.shi@linaro.org>
      Cc: Ning Qu <quning@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      52b6f46b
    • Andres Lagar-Cavilla's avatar
      tmpfs: mem_cgroup charge fault to vm_mm not current mm · 9e18eb29
      Andres Lagar-Cavilla authored
      
      
      Although shmem_fault() has been careful to count a major fault to vm_mm,
      shmem_getpage_gfp() has been careless in charging a remote access fault
      to current->mm owner's memcg instead of to vma->vm_mm owner's memcg:
      that is inconsistent with all the mem_cgroup charging on remote access
      faults in mm/memory.c.
      
      Fix it by passing fault_mm along with fault_type to
      shmem_get_page_gfp(); but in that case, now knowing the right mm, it's
      better for it to handle the PGMAJFAULT updates itself.
      
      And let's keep this clutter out of most callers' way: change the common
      shmem_getpage() wrapper to hide fault_mm and fault_type as well as gfp.
      
      Signed-off-by: default avatarAndres Lagar-Cavilla <andreslc@google.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Yang Shi <yang.shi@linaro.org>
      Cc: Ning Qu <quning@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9e18eb29
    • Hugh Dickins's avatar
      tmpfs: preliminary minor tidyups · 75edd345
      Hugh Dickins authored
      
      
      Make a few cleanups in mm/shmem.c, before going on to complicate it.
      
      shmem_alloc_page() will become more complicated: we can't afford to to
      have that complication duplicated between a CONFIG_NUMA version and a
      !CONFIG_NUMA version, so rearrange the #ifdef'ery there to yield a
      single shmem_swapin() and a single shmem_alloc_page().
      
      Yes, it's a shame to inflict the horrid pseudo-vma on non-NUMA
      configurations, but eliminating it is a larger cleanup: I have an
      alloc_pages_mpol() patchset not yet ready - mpol handling is subtle and
      bug-prone, and changed yet again since my last version.
      
      Move __SetPageLocked, __SetPageSwapBacked from shmem_getpage_gfp() to
      shmem_alloc_page(): that SwapBacked flag will be useful in future, to
      help to distinguish different cases appropriately.
      
      And the SGP_DIRTY variant of SGP_CACHE is hard to understand and of
      little use (IIRC it dates back to when shmem_getpage() returned the page
      unlocked): kill it and do the necessary in shmem_file_read_iter().
      
      But an arm64 build then complained that info may be uninitialized (where
      shmem_getpage_gfp() deletes a freshly alloced page beyond eof), and
      advancing to an "sgp <= SGP_CACHE" test jogged it back to reality.
      
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Yang Shi <yang.shi@linaro.org>
      Cc: Ning Qu <quning@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      75edd345
    • Hugh Dickins's avatar
      mm: use __SetPageSwapBacked and dont ClearPageSwapBacked · fa9949da
      Hugh Dickins authored
      v3.16 commit 07a42788
      
       ("mm: shmem: avoid atomic operation during
      shmem_getpage_gfp") rightly replaced one instance of SetPageSwapBacked
      by __SetPageSwapBacked, pointing out that the newly allocated page is
      not yet visible to other users (except speculative get_page_unless_zero-
      ers, who may not update page flags before their further checks).
      
      That was part of a series in which Mel was focused on tmpfs profiles:
      but almost all SetPageSwapBacked uses can be so optimized, with the same
      justification.
      
      Remove ClearPageSwapBacked from __read_swap_cache_async() error path:
      it's not an error to free a page with PG_swapbacked set.
      
      Follow a convention of __SetPageLocked, __SetPageSwapBacked instead of
      doing it differently in different places; but that's for tidiness - if
      the ordering actually mattered, we should not be using the __variants.
      
      There's probably scope for further __SetPageFlags in other places, but
      SwapBacked is the one I'm interested in at the moment.
      
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Yang Shi <yang.shi@linaro.org>
      Cc: Ning Qu <quning@gmail.com>
      Reviewed-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fa9949da
    • Hugh Dickins's avatar
      mm: update_lru_size do the __mod_zone_page_state · 9d5e6a9f
      Hugh Dickins authored
      
      
      Konstantin Khlebnikov pointed out (nearly four years ago, when lumpy
      reclaim was removed) that lru_size can be updated by -nr_taken once per
      call to isolate_lru_pages(), instead of page by page.
      
      Update it inside isolate_lru_pages(), or at its two callsites? I chose
      to update it at the callsites, rearranging and grouping the updates by
      nr_taken and nr_scanned together in both.
      
      With one exception, mem_cgroup_update_lru_size(,lru,) is then used where
      __mod_zone_page_state(,NR_LRU_BASE+lru,) is used; and we shall be adding
      some more calls in a future commit.  Make the code a little smaller and
      simpler by incorporating stat update in lru_size update.
      
      The exception was move_active_pages_to_lru(), which aggregated the
      pgmoved stat update separately from the individual lru_size updates; but
      I still think this a simplification worth making.
      
      However, the __mod_zone_page_state is not peculiar to mem_cgroups: so
      better use the name update_lru_size, calls mem_cgroup_update_lru_size
      when CONFIG_MEMCG.
      
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Yang Shi <yang.shi@linaro.org>
      Cc: Ning Qu <quning@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9d5e6a9f
    • Hugh Dickins's avatar
      mm: update_lru_size warn and reset bad lru_size · ca707239
      Hugh Dickins authored
      
      
      Though debug kernels have a VM_BUG_ON to help protect from misaccounting
      lru_size, non-debug kernels are liable to wrap it around: and then the
      vast unsigned long size draws page reclaim into a loop of repeatedly
      doing nothing on an empty list, without even a cond_resched().
      
      That soft lockup looks confusingly like an over-busy reclaim scenario,
      with lots of contention on the lru_lock in shrink_inactive_list(): yet
      has a totally different origin.
      
      Help differentiate with a custom warning in
      mem_cgroup_update_lru_size(), even in non-debug kernels; and reset the
      size to avoid the lockup.  But the particular bug which suggested this
      change was mine alone, and since fixed.
      
      Make it a WARN_ONCE: the first occurrence is the most informative, a
      flurry may follow, yet even when rate-limited little more is learnt.
      
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Yang Shi <yang.shi@linaro.org>
      Cc: Ning Qu <quning@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ca707239
    • Konstantin Khlebnikov's avatar
    • Joonsoo Kim's avatar
      mm/vmstat: make node_page_state() handles all zones by itself · e87d59f7
      Joonsoo Kim authored
      
      
      node_page_state() manually adds statistics per each zone and returns
      total value for all zones.  Whenever we add a new zone, we need to
      consider this function and it's really troublesome.  Make it handle all
      zones by itself.
      
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reviewed-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e87d59f7
    • Joonsoo Kim's avatar
      mm/highmem: make nr_free_highpages() handles all highmem zones by itself · 33499bfe
      Joonsoo Kim authored
      
      
      nr_free_highpages() manually adds statistics per each highmem zone and
      returns a total value for them.  Whenever we add a new highmem zone, we
      need to consider this function and it's really troublesome.  Make it
      handle all highmem zones by itself.
      
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reviewed-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      33499bfe
    • Joonsoo Kim's avatar
      mm/page_alloc: correct highmem memory statistics · fc2bd799
      Joonsoo Kim authored
      
      
      ZONE_MOVABLE could be treated as highmem so we need to consider it for
      accurate statistics.  And, in following patches, ZONE_CMA will be
      introduced and it can be treated as highmem, too.  So, instead of
      manually adding stat of ZONE_MOVABLE, looping all zones and check
      whether the zone is highmem or not and add stat of the zone which can be
      treated as highmem.
      
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reviewed-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fc2bd799
    • Joonsoo Kim's avatar
      mm/writeback: correct dirty page calculation for highmem · 09b4ab3c
      Joonsoo Kim authored
      
      
      ZONE_MOVABLE could be treated as highmem so we need to consider it for
      accurate calculation of dirty pages.  And, in following patches,
      ZONE_CMA will be introduced and it can be treated as highmem, too.  So,
      instead of manually adding stat of ZONE_MOVABLE, looping all zones and
      check whether the zone is highmem or not and add stat of the zone which
      can be treated as highmem.
      
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reviewed-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      09b4ab3c
    • Joonsoo Kim's avatar
      power: add zone range overlapping check · ba6b0979
      Joonsoo Kim authored
      
      
      There is a system thats node's pfns are overlapped as follows:
      
        -----pfn-------->
        N0 N1 N2 N0 N1 N2
      
      Therefore, we need to care this overlapping when iterating pfn range.
      
      mark_free_pages() iterates requested zone's pfn range and unset all
      range's bitmap first.  And then it marks freepages in a zone to the
      bitmap.  If there is an overlapping zone, above unset could clear
      previous marked bit and reference to this bitmap in the future will
      cause the problem.  To prevent it, this patch adds a zone check in
      mark_free_pages().
      
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ba6b0979
    • Joonsoo Kim's avatar
      mm/page_owner: add zone range overlapping check · 9d43f5ae
      Joonsoo Kim authored
      
      
      There is a system thats node's pfns are overlapped as follows:
      
        -----pfn-------->
        N0 N1 N2 N0 N1 N2
      
      Therefore, we need to care this overlapping when iterating pfn range.
      
      There are one place in page_owner.c that iterates pfn range and it
      doesn't consider this overlapping.  Add it.
      
      Without this patch, above system could over count early allocated page
      number before page_owner is activated.
      
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9d43f5ae
    • Joonsoo Kim's avatar
      mm/vmstat: add zone range overlapping check · a91c43c7
      Joonsoo Kim authored
      
      
      There is a system thats node's pfns are overlapped as follows:
      
        -----pfn-------->
        N0 N1 N2 N0 N1 N2
      
      Therefore, we need to care this overlapping when iterating pfn range.
      
      There are two places in vmstat.c that iterates pfn range and they don't
      consider this overlapping.  Add it.
      
      Without this patch, above system could over count pageblock number on a
      zone.
      
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a91c43c7
    • Joonsoo Kim's avatar
      mm/memory_hotplug: add comment to some functions related to memory hotplug · b9eb6319
      Joonsoo Kim authored
      
      
      __offline_isolated_pages() and test_pages_isolated() are used by memory
      hotplug.  These functions require that range is in a single zone but
      there is no code to do this because memory hotplug checks it before
      calling these functions.  To avoid confusing future user of these
      functions, this patch adds comments to them.
      
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b9eb6319
    • Joonsoo Kim's avatar
      mm/hugetlb: add same zone check in pfn_range_valid_gigantic() · f44b2dda
      Joonsoo Kim authored
      This patchset deals with some problematic sites that iterate pfn ranges.
      
      There is a system thats node's pfns are overlapped as follows:
      
        -----pfn-------->
        N0 N1 N2 N0 N1 N2
      
      Therefore, we need to take care of this overlapping when iterating pfn
      range.
      
      I audit many iterating sites that uses pfn_valid(), pfn_valid_within(),
      zone_start_pfn and etc.  and others looks safe to me.  This is a
      preparation step for a new CMA implementation, ZONE_CMA
      (https://lkml.org/lkml/2015/2/12/95
      
      ), because it would be easily
      overlapped with other zones.  But, zone overlap check is also needed for
      the general case so I send it separately.
      
      This patch (of 5):
      
      alloc_gigantic_page() uses alloc_contig_range() and this requires that
      the requested range is in a single zone.  To satisfy this requirement,
      add this check to pfn_range_valid_gigantic().
      
      Signed-off-by: default avatarJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f44b2dda
    • Andrew Morton's avatar
      mm: uninline page_mapped() · 1aa8aea5
      Andrew Morton authored
      
      
      It's huge.  Uninlining it saves 206 bytes per callsite.  Shaves 4924
      bytes from the x86_64 allmodconfig vmlinux.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Cc: Steve Capper <steve.capper@arm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1aa8aea5
    • Chanho Min's avatar
      mm/highmem: simplify is_highmem() · 29f9cb53
      Chanho Min authored
      
      
      is_highmem() can be simplified by use of is_highmem_idx().  This patch
      removes redundant code and will make it easier to maintain if the zone
      policy is changed or a new zone is added.
      
      (akpm: saves me 25 bytes of text per is_highmem() callsite)
      
      Signed-off-by: default avatarChanho Min <chanho.min@lge.com>
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      29f9cb53
    • Vlastimil Babka's avatar
      mm, compaction: skip blocks where isolation fails in async direct compaction · fdd048e1
      Vlastimil Babka authored
      
      
      The goal of direct compaction is to quickly make a high-order page
      available for the pending allocation.  Within an aligned block of pages
      of desired order, a single allocated page that cannot be isolated for
      migration means that the block cannot fully merge to a buddy page that
      would satisfy the allocation request.  Therefore we can reduce the
      allocation stall by skipping the rest of the block immediately on
      isolation failure.  For async compaction, this also means a higher
      chance of succeeding until it detects contention.
      
      We however shouldn't completely sacrifice the second objective of
      compaction, which is to reduce overal long-term memory fragmentation.
      As a compromise, perform the eager skipping only in direct async
      compaction, while sync compaction (including kcompactd) remains
      thorough.
      
      Testing was done using stress-highalloc from mmtests, configured for
      order-4 GFP_KERNEL allocations:
      
                                       4.6-rc1               4.6-rc1
                                        before                 after
        Success 1 Min         24.00 (  0.00%)       27.00 (-12.50%)
        Success 1 Mean        30.20 (  0.00%)       31.60 ( -4.64%)
        Success 1 Max         37.00 (  0.00%)       35.00 (  5.41%)
        Success 2 Min         42.00 (  0.00%)       32.00 ( 23.81%)
        Success 2 Mean        44.00 (  0.00%)       44.80 ( -1.82%)
        Success 2 Max         48.00 (  0.00%)       52.00 ( -8.33%)
        Success 3 Min         91.00 (  0.00%)       92.00 ( -1.10%)
        Success 3 Mean        92.20 (  0.00%)       92.80 ( -0.65%)
        Success 3 Max         94.00 (  0.00%)       93.00 (  1.06%)
      
      We can see that success rates are unaffected by the skipping.
      
                      4.6-rc1     4.6-rc1
                       before       after
        User         2587.42     2566.53
        System        482.89      471.20
        Elapsed      1395.68     1382.00
      
      Times are not so useful metric for this benchmark as main portion is the
      interfering kernel builds, but results do hint at reduced system times.
      
                                            4.6-rc1     4.6-rc1
                                             before       after
        Direct pages scanned                163614      159608
        Kswapd pages scanned               2070139     2078790
        Kswapd pages reclaimed             2061707     2069757
        Direct pages reclaimed              163354      159505
      
      Reduced direct reclaim was unintended, but could be explained by more
      successful first attempt at (async) direct compaction, which is
      attempted before the first reclaim attempt in __alloc_pages_slowpath().
      
        Compaction stalls                    33052       39853
        Compaction success                   12121       19773
        Compaction failures                  20931       20079
      
      Compaction is indeed more successful, and thus less likely to get
      deferred, so there are also more direct compaction stalls.
      
        Page migrate success               3781876     3326819
        Page migrate failure                 45817       41774
        Compaction pages isolated          7868232     6941457
        Compaction migrate scanned       168160492   127269354
        Compaction migrate prescanned            0           0
        Compaction free scanned         2522142582  2326342620
        Compaction free direct alloc             0           0
        Compaction free dir. all. miss           0           0
        Compaction cost                       5252        4476
      
      The patch reduces migration scanned pages by 25% thanks to the eager
      skipping.
      
      [hughd@google.com: prevent nr_isolated_* from going negative]
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fdd048e1
    • Vlastimil Babka's avatar
      mm, compaction: reduce spurious pcplist drains · a34753d2
      Vlastimil Babka authored
      
      
      Compaction drains the local pcplists each time migration scanner moves
      away from a cc->order aligned block where it isolated pages for
      migration, so that the pages freed by migrations can merge into higher
      orders.
      
      The detection is currently coarser than it could be.  The
      cc->last_migrated_pfn variable should track the lowest pfn that was
      isolated for migration.  But it is set to the pfn where
      isolate_migratepages_block() starts scanning, which is typically the
      first pfn of the pageblock.  There, the scanner might fail to isolate
      several order-aligned blocks, and then isolate COMPACT_CLUSTER_MAX in
      another block.  This would cause the pcplists drain to be performed,
      although the scanner didn't yet finish the block where it isolated from.
      
      This patch thus makes cc->last_migrated_pfn handling more accurate by
      setting it to the pfn of an actually isolated page in
      isolate_migratepages_block().  Although practical effects of this patch
      are likely low, it arguably makes the intent of the code more obvious.
      Also the next patch will make async direct compaction skip blocks more
      aggressively, and draining pcplists due to skipped blocks is wasteful.
      
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a34753d2
    • Vlastimil Babka's avatar
      mm, compaction: wrap calculating first and last pfn of pageblock · 06b6640a
      Vlastimil Babka authored
      
      
      Compaction code has accumulated numerous instances of manual
      calculations of the first (inclusive) and last (exclusive) pfn of a
      pageblock (or a smaller block of given order), given a pfn within the
      pageblock.
      
      Wrap these calculations by introducing pageblock_start_pfn(pfn) and
      pageblock_end_pfn(pfn) macros.
      
      [vbabka@suse.cz: fix crash in get_pfnblock_flags_mask() from isolate_freepages():]
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      06b6640a
    • Konstantin Khlebnikov's avatar
      mm/rmap: replace BUG_ON(anon_vma->degree) with VM_WARN_ON · e4c5800a
      Konstantin Khlebnikov authored
      
      
      This check effectively catches anon vma hierarchy inconsistence and some
      vma corruptions.  It was effective for catching corner cases in anon vma
      reusing logic.  For now this code seems stable so check could be hidden
      under CONFIG_DEBUG_VM and replaced with WARN because it's not so fatal.
      
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Suggested-by: default avatarVasily Averin <vvs@virtuozzo.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e4c5800a
    • Andrew Morton's avatar
      mm/mempolicy.c:offset_il_node() document and clarify · fee83b3a
      Andrew Morton authored
      
      
      This code was pretty obscure and was relying upon obscure side-effects
      of next_node(-1, ...) and was relying upon NUMA_NO_NODE being equal to
      -1.
      
      Clean that all up and document the function's intent.
      
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fee83b3a
    • Andrew Morton's avatar
      mm/hugetlb.c: use first_memory_node · 54f18d35
      Andrew Morton authored
      
      
      Instead of open-coding it.
      
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      54f18d35
    • Li Zhang's avatar
      mm/page_alloc: Remove useless parameter of __free_pages_boot_core · 949698a3
      Li Zhang authored
      
      
      __free_pages_boot_core has parameter pfn which is not used at all.
      Remove it.
      
      Signed-off-by: default avatarLi Zhang <zhlcindy@linux.vnet.ibm.com>
      Reviewed-by: default avatarPan Xinhui <xinhui.pan@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      949698a3
    • Michal Hocko's avatar
      mm/memcontrol.c:mem_cgroup_select_victim_node(): clarify comment · fda3d69b
      Michal Hocko authored
      
      
      > The comment seems to have not much to do with the code?
      
      I guess the comment tries to say that the code path is triggered when we
      charge the page which happens _before_ it is added to the LRU list and
      so last_scanned_node might contain the stale data.
      
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fda3d69b
    • Yaowei Bai's avatar
      mm/mempolicy.c: vma_migratable() can return bool · 4ee815be
      Yaowei Bai authored
      
      
      Make vma_migratable() return bool due to this particular function only
      using either one or zero as its return value.
      
      Signed-off-by: default avatarYaowei Bai <baiyaowei@cmss.chinamobile.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4ee815be
    • Yaowei Bai's avatar
      mm/vmalloc.c: is_vmalloc_addr() can return bool · bb00a789
      Yaowei Bai authored
      
      
      Make is_vmalloc_addr() return bool to improve readability due to this
      particular function only using either one or zero as its return value.
      
      Signed-off-by: default avatarYaowei Bai <baiyaowei@cmss.chinamobile.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bb00a789
    • Yaowei Bai's avatar
      mm/memory_hotplug: is_mem_section_removable() can return bool · c98940f6
      Yaowei Bai authored
      
      
      Make is_mem_section_removable() return bool to improve readability due
      to this particular function only using either one or zero as its return
      value.
      
      Signed-off-by: default avatarYaowei Bai <baiyaowei@cmss.chinamobile.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c98940f6
    • Yaowei Bai's avatar
      mm/hugetlb: is_vm_hugetlb_page() can return bool · 32f6271d
      Yaowei Bai authored
      
      
      Make is_vm_hugetlb_page() return bool to improve readability due to this
      particular function only using either one or zero as its return value.
      
      Signed-off-by: default avatarYaowei Bai <baiyaowei@cmss.chinamobile.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      32f6271d
    • Vaishali Thakkar's avatar
      x86: mm: use hugetlb_bad_size() · 2b18e532
      Vaishali Thakkar authored
      
      
      Update setup_hugepagesz() to call hugetlb_bad_size() when unsupported
      hugepage size is found.
      
      Signed-off-by: default avatarVaishali Thakkar <vaishali.thakkar@oracle.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
      Cc: Dominik Dingel <dingel@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2b18e532
    • Vaishali Thakkar's avatar
      tile: mm: use hugetlb_bad_size() · b3d424f1
      Vaishali Thakkar authored
      
      
      Update setup_hugepagesz() to call hugetlb_bad_size() when unsupported
      hugepage size is found.
      
      Signed-off-by: default avatarVaishali Thakkar <vaishali.thakkar@oracle.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
      Cc: Dominik Dingel <dingel@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: James Hogan <james.hogan@imgtec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b3d424f1
    • Vaishali Thakkar's avatar
      powerpc: mm: use hugetlb_bad_size() · 71bf79cc
      Vaishali Thakkar authored
      
      
      Update setup_hugepagesz() to call hugetlb_bad_size() when unsupported
      hugepage size is found.
      
      Signed-off-by: default avatarVaishali Thakkar <vaishali.thakkar@oracle.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
      Cc: Dominik Dingel <dingel@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      71bf79cc
    • Vaishali Thakkar's avatar
      metag: mm: use hugetlb_bad_size() · 9cc3387f
      Vaishali Thakkar authored
      
      
      Update setup_hugepagesz() to call hugetlb_bad_size() when unsupported
      hugepage size is found.
      
      Signed-off-by: default avatarVaishali Thakkar <vaishali.thakkar@oracle.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
      Cc: Dominik Dingel <dingel@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9cc3387f
    • Vaishali Thakkar's avatar
      arm64: mm: use hugetlb_bad_size() · d77e20ce
      Vaishali Thakkar authored
      
      
      Update setup_hugepagesz() to call hugetlb_bad_size() when unsupported
      hugepage size is found.
      
      Signed-off-by: default avatarVaishali Thakkar <vaishali.thakkar@oracle.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
      Cc: Dominik Dingel <dingel@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d77e20ce
    • Vaishali Thakkar's avatar
      mm/hugetlb: introduce hugetlb_bad_size() · 9fee021d
      Vaishali Thakkar authored
      
      
      When any unsupported hugepage size is specified, 'hugepagesz=' and
      'hugepages=' should be ignored during command line parsing until any
      supported hugepage size is found.  But currently incorrect number of
      hugepages are allocated when unsupported size is specified as it fails
      to ignore the 'hugepages=' command.
      
      Test case:
      
      Note that this is specific to x86 architecture.
      
      Boot the kernel with command line option 'hugepagesz=256M hugepages=X'.
      After boot, dmesg output shows that X number of hugepages of the size 2M
      is pre-allocated instead of 0.
      
      So, to handle such command line options, introduce new routine
      hugetlb_bad_size.  The routine hugetlb_bad_size sets the global variable
      parsed_valid_hugepagesz.  We are using parsed_valid_hugepagesz to save
      the state when unsupported hugepagesize is found so that we can ignore
      the 'hugepages=' parameters after that and then reset the variable when
      supported hugepage size is found.
      
      The routine hugetlb_bad_size can be called while setting 'hugepagesz='
      parameter in an architecture specific code.
      
      Signed-off-by: default avatarVaishali Thakkar <vaishali.thakkar@oracle.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
      Cc: Dominik Dingel <dingel@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9fee021d
    • Mike Kravetz's avatar
      mm/hugetlb: optimize minimum size (min_size) accounting · 09a95e29
      Mike Kravetz authored
      
      
      It was observed that minimum size accounting associated with the
      hugetlbfs min_size mount option may not perform optimally and as
      expected.  As huge pages/reservations are released from the filesystem
      and given back to the global pools, they are reserved for subsequent
      filesystem use as long as the subpool reserved count is less than
      subpool minimum size.  It does not take into account used pages within
      the filesystem.  The filesystem size limits are not exceeded and this is
      technically not a bug.  However, better behavior would be to wait for
      the number of used pages/reservations associated with the filesystem to
      drop below the minimum size before taking reservations to satisfy
      minimum size.
      
      An optimization is also made to the hugepage_subpool_get_pages() routine
      which is called when pages/reservations are allocated.  This does not
      change behavior, but simply avoids the accounting if all reservations
      have already been taken (subpool reserved count == 0).
      
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      09a95e29
    • Andrew Morton's avatar
      include/linux/nodemask.h: create next_node_in() helper · 0edaf86c
      Andrew Morton authored
      
      
      Lots of code does
      
      	node = next_node(node, XXX);
      	if (node == MAX_NUMNODES)
      		node = first_node(XXX);
      
      so create next_node_in() to do this and use it in various places.
      
      [mhocko@suse.com: use next_node_in() helper]
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarMichal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Cc: Hui Zhu <zhuhui@xiaomi.com>
      Cc: Wang Xiaoqiang <wangxq10@lzu.edu.cn>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0edaf86c
    • Rasmus Villemoes's avatar
      include/linux: apply __malloc attribute · 48a27055
      Rasmus Villemoes authored
      
      
      Attach the malloc attribute to a few allocation functions.  This helps
      gcc generate better code by telling it that the return value doesn't
      alias any existing pointers (which is even more valuable given the
      pessimizations implied by -fno-strict-aliasing).
      
      A simple example of what this allows gcc to do can be seen by looking at
      the last part of drm_atomic_helper_plane_reset:
      
      	plane->state = kzalloc(sizeof(*plane->state), GFP_KERNEL);
      
      	if (plane->state) {
      		plane->state->plane = plane;
      		plane->state->rotation = BIT(DRM_ROTATE_0);
      	}
      
      which compiles to
      
          e8 99 bf d6 ff          callq  ffffffff8116d540 <kmem_cache_alloc_trace>
          48 85 c0                test   %rax,%rax
          48 89 83 40 02 00 00    mov    %rax,0x240(%rbx)
          74 11                   je     ffffffff814015c4 <drm_atomic_helper_plane_reset+0x64>
          48 89 18                mov    %rbx,(%rax)
          48 8b 83 40 02 00 00    mov    0x240(%rbx),%rax [*]
          c7 40 40 01 00 00 00    movl   $0x1,0x40(%rax)
      
      With this patch applied, the instruction at [*] is elided, since the
      store to plane->state->plane is known to not alter the value of
      plane->state.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarRasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      48a27055
    • Rasmus Villemoes's avatar
      compiler.h: add support for malloc attribute · d64e85d3
      Rasmus Villemoes authored
      gcc as far back as at least 3.04 documents the function attribute
      __malloc__.  Add a shorthand for attaching that to a function
      declaration.  This was also suggested by Andi Kleen way back in 2002
      [1], but didn't get applied, perhaps because gcc at that time generated
      the exact same code with and without this attribute.
      
      This attribute tells the compiler that the return value (if non-NULL)
      can be assumed not to alias any other valid pointers at the time of the
      call.
      
      Please note that the documentation for a range of gcc versions (starting
      from around 4.7) contained a somewhat confusing and self-contradicting
      text:
      
        The malloc attribute is used to tell the compiler that a function may
        be treated as if any non-NULL pointer it returns cannot alias any other
        pointer valid when the function returns and *that the memory has
        undefined content*.  [...] Standard functions with this property include
        malloc and *calloc*.
      
      (emphasis mine). The intended meaning has later been clarified [2]:
      
        This tells the compiler that a function is malloc-like, i.e., that the
        pointer P returned by the function cannot alias any other pointer valid
        when the function returns, and moreover no pointers to valid objects
        occur in any storage addressed by P.
      
      What this means is that we can apply the attribute to kmalloc and
      friends, and it is ok for the returned memory to have well-defined
      contents (__GFP_ZERO).  But it is not ok to apply it to kmemdup(), nor
      to other functions which both allocate and possibly initialize the
      memory with existing pointers.  So unless someone is doing something
      pretty perverted kstrdup() should also be a fine candidate.
      
      [1] http://thread.gmane.org/gmane.linux.kernel/57172
      [2] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56955
      
      
      
      Signed-off-by: default avatarRasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d64e85d3