Skip to content
  1. Feb 01, 2018
    • Johannes Weiner's avatar
      mm: memcontrol: fix excessive complexity in memory.stat reporting · a983b5eb
      Johannes Weiner authored
      
      
      We've seen memory.stat reads in top-level cgroups take up to fourteen
      seconds during a userspace bug that created tens of thousands of ghost
      cgroups pinned by lingering page cache.
      
      Even with a more reasonable number of cgroups, aggregating memory.stat
      is unnecessarily heavy.  The complexity is this:
      
      	nr_cgroups * nr_stat_items * nr_possible_cpus
      
      where the stat items are ~70 at this point.  With 128 cgroups and 128
      CPUs - decent, not enormous setups - reading the top-level memory.stat
      has to aggregate over a million per-cpu counters.  This doesn't scale.
      
      Instead of spreading the source of truth across all CPUs, use the
      per-cpu counters merely to batch updates to shared atomic counters.
      
      This is the same as the per-cpu stocks we use for charging memory to the
      shared atomic page_counters, and also the way the global vmstat counters
      are implemented.
      
      Vmstat has elaborate spilling thresholds that depend on the number of
      CPUs, amount of memory, and memory pressure - carefully balancing the
      cost of counter updates with the amount of per-cpu error.  That's
      because the vmstat counters are system-wide, but also used for decisions
      inside the kernel (e.g.  NR_FREE_PAGES in the allocator).  Neither is
      true for the memory controller.
      
      Use the same static batch size we already use for page_counter updates
      during charging.  The per-cpu error in the stats will be 128k, which is
      an acceptable ratio of cores to memory accounting granularity.
      
      [hannes@cmpxchg.org: fix warning in __this_cpu_xchg() calls]
        Link: http://lkml.kernel.org/r/20171201135750.GB8097@cmpxchg.org
      Link: http://lkml.kernel.org/r/20171103153336.24044-3-hannes@cmpxchg.org
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a983b5eb
    • Johannes Weiner's avatar
      mm: memcontrol: implement lruvec stat functions on top of each other · 28454265
      Johannes Weiner authored
      
      
      The implementation of the lruvec stat functions and their variants for
      accounting through a page, or accounting from a preemptible context, are
      mostly identical and needlessly repetitive.
      
      Implement the lruvec_page functions by looking up the page's lruvec and
      then using the lruvec function.
      
      Implement the functions for preemptible contexts by disabling preemption
      before calling the atomic context functions.
      
      Link: http://lkml.kernel.org/r/20171103153336.24044-2-hannes@cmpxchg.org
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      28454265
    • Johannes Weiner's avatar
      mm: memcontrol: eliminate raw access to stat and event counters · c9019e9b
      Johannes Weiner authored
      
      
      Replace all raw 'this_cpu_' modifications of the stat and event per-cpu
      counters with API functions such as mod_memcg_state().
      
      This makes the code easier to read, but is also in preparation for the
      next patch, which changes the per-cpu implementation of those counters.
      
      Link: http://lkml.kernel.org/r/20171103153336.24044-1-hannes@cmpxchg.org
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarVladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c9019e9b
    • Yang Shi's avatar
      mm/filemap.c: remove include of hardirq.h · 2b9fceb3
      Yang Shi authored
      
      
      in_atomic() has been moved to include/linux/preempt.h, and the filemap.c
      doesn't use in_atomic() directly at all, so it sounds unnecessary to
      include hardirq.h.
      
      Link: http://lkml.kernel.org/r/1509985319-38633-1-git-send-email-yang.s@alibaba-inc.com
      Signed-off-by: default avatarYang Shi <yang.s@alibaba-inc.com>
      Reviewed-by: default avatarMatthew Wilcox <mawilcox@microsoft.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2b9fceb3
    • Pavel Tatashin's avatar
      mm: split deferred_init_range into initializing and freeing parts · 80b1f41c
      Pavel Tatashin authored
      
      
      In deferred_init_range() we initialize struct pages, and also free them
      to buddy allocator.  We do it in separate loops, because buddy page is
      computed ahead, so we do not want to access a struct page that has not
      been initialized yet.
      
      There is still, however, a corner case where it is potentially possible
      to access uninitialized struct page: this is when buddy page is from the
      next memblock range.
      
      This patch fixes this problem by splitting deferred_init_range() into
      two functions: one to initialize struct pages, and another to free them.
      
      In addition, this patch brings the following improvements:
       - Get rid of __def_free() helper function. And simplifies loop logic by
         adding a new pfn validity check function: deferred_pfn_valid().
       - Reduces number of variables that we track. So, there is a higher
         chance that we will avoid using stack to store/load variables inside
         hot loops.
       - Enables future multi-threading of these functions: do initialization
         in multiple threads, wait for all threads to finish, do freeing part
         in multithreading.
      
      Tested on x86 with 1T of memory to make sure no regressions are
      introduced.
      
      [akpm@linux-foundation.org: fix spello in comment]
      Link: http://lkml.kernel.org/r/20171107150446.32055-2-pasha.tatashin@oracle.com
      Signed-off-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      80b1f41c
    • Josef Bacik's avatar
      mm: use sc->priority for slab shrink targets · 9092c71b
      Josef Bacik authored
      
      
      Previously we were using the ratio of the number of lru pages scanned to
      the number of eligible lru pages to determine the number of slab objects
      to scan.  The problem with this is that these two things have nothing to
      do with each other, so in slab heavy work loads where there is little to
      no page cache we can end up with the pages scanned being a very low
      number.  This means that we reclaim next to no slab pages and waste a
      lot of time reclaiming small amounts of space.
      
      Consider the following scenario, where we have the following values and
      the rest of the memory usage is in slab
      
        Active:            58840 kB
        Inactive:          46860 kB
      
      Every time we do a get_scan_count() we do this
      
        scan = size >> sc->priority
      
      where sc->priority starts at DEF_PRIORITY, which is 12.  The first loop
      through reclaim would result in a scan target of 2 pages to 11715 total
      inactive pages, and 3 pages to 14710 total active pages.  This is a
      really really small target for a system that is entirely slab pages.
      And this is super optimistic, this assumes we even get to scan these
      pages.  We don't increment sc->nr_scanned unless we 1) isolate the page,
      which assumes it's not in use, and 2) can lock the page.  Under pressure
      these numbers could probably go down, I'm sure there's some random pages
      from daemons that aren't actually in use, so the targets get even
      smaller.
      
      Instead use sc->priority in the same way we use it to determine scan
      amounts for the lru's.  This generally equates to pages.  Consider the
      following
      
        slab_pages = (nr_objects * object_size) / PAGE_SIZE
      
      What we would like to do is
      
        scan = slab_pages >> sc->priority
      
      but we don't know the number of slab pages each shrinker controls, only
      the objects.  However say that theoretically we knew how many pages a
      shrinker controlled, we'd still have to convert this to objects, which
      would look like the following
      
        scan = shrinker_pages >> sc->priority
        scan_objects = (PAGE_SIZE / object_size) * scan
      
      or written another way
      
        scan_objects = (shrinker_pages >> sc->priority) *
      		 (PAGE_SIZE / object_size)
      
      which can thus be written
      
        scan_objects = ((shrinker_pages * PAGE_SIZE) / object_size) >>
      		 sc->priority
      
      which is just
      
        scan_objects = nr_objects >> sc->priority
      
      We don't need to know exactly how many pages each shrinker represents,
      it's objects are all the information we need.  Making this change allows
      us to place an appropriate amount of pressure on the shrinker pools for
      their relative size.
      
      Link: http://lkml.kernel.org/r/1510780549-6812-1-git-send-email-josef@toxicpanda.com
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarDave Chinner <david@fromorbit.com>
      Acked-by: default avatarAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9092c71b
    • Roman Gushchin's avatar
      mm: show total hugetlb memory consumption in /proc/meminfo · fcb2b0c5
      Roman Gushchin authored
      
      
      Currently we display some hugepage statistics (total, free, etc) in
      /proc/meminfo, but only for default hugepage size (e.g.  2Mb).
      
      If hugepages of different sizes are used (like 2Mb and 1Gb on x86-64),
      /proc/meminfo output can be confusing, as non-default sized hugepages
      are not reflected at all, and there are no signs that they are existing
      and consuming system memory.
      
      To solve this problem, let's display the total amount of memory,
      consumed by hugetlb pages of all sized (both free and used).  Let's call
      it "Hugetlb", and display size in kB to match generic /proc/meminfo
      style.
      
      For example, (1024 2Mb pages and 2 1Gb pages are pre-allocated):
        $ cat /proc/meminfo
        MemTotal:        8168984 kB
        MemFree:         3789276 kB
        <...>
        CmaFree:               0 kB
        HugePages_Total:    1024
        HugePages_Free:     1024
        HugePages_Rsvd:        0
        HugePages_Surp:        0
        Hugepagesize:       2048 kB
        Hugetlb:         4194304 kB
        DirectMap4k:       32632 kB
        DirectMap2M:     4161536 kB
        DirectMap1G:     6291456 kB
      
      Also, this patch updates corresponding docs to reflect Hugetlb entry
      meaning and difference between Hugetlb and HugePages_Total * Hugepagesize.
      
      Link: http://lkml.kernel.org/r/20171115231409.12131-1-guro@fb.com
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fcb2b0c5
    • Michal Hocko's avatar
      mm: drop hotplug lock from lru_add_drain_all() · 9852a721
      Michal Hocko authored
      Pulling cpu hotplug locks inside the mm core function like
      lru_add_drain_all just asks for problems and the recent lockdep splat
      [1] just proves this.  While the usage in that particular case might be
      wrong we should avoid the locking as lru_add_drain_all() is used in many
      places.  It seems that this is not all that hard to achieve actually.
      
      We have done the same thing for drain_all_pages which is analogous by
      commit a459eeb7
      
       ("mm, page_alloc: do not depend on cpu hotplug locks
      inside the allocator").  All we have to care about is to handle
      
            - the work item might be executed on a different cpu in worker from
              unbound pool so it doesn't run on pinned on the cpu
      
            - we have to make sure that we do not race with page_alloc_cpu_dead
              calling lru_add_drain_cpu
      
      the first part is already handled because the worker calls lru_add_drain
      which disables preemption when calling lru_add_drain_cpu on the local
      cpu it is draining.  The later is true because page_alloc_cpu_dead is
      called on the controlling CPU after the hotplugged CPU vanished
      completely.
      
      [1] http://lkml.kernel.org/r/089e0825eec8955c1f055c83d476@google.com
      
      [add a cpu hotplug locking interaction as per tglx]
      Link: http://lkml.kernel.org/r/20171116120535.23765-1-mhocko@kernel.org
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9852a721
    • Yisheng Xie's avatar
      mm/mempolicy: add nodes_empty check in SYSC_migrate_pages · 0486a38b
      Yisheng Xie authored
      
      
      As in manpage of migrate_pages, the errno should be set to EINVAL when
      none of the node IDs specified by new_nodes are on-line and allowed by
      the process's current cpuset context, or none of the specified nodes
      contain memory.  However, when test by following case:
      
      	new_nodes = 0;
      	old_nodes = 0xf;
      	ret = migrate_pages(pid, old_nodes, new_nodes, MAX);
      
      The ret will be 0 and no errno is set.  As the new_nodes is empty, we
      should expect EINVAL as documented.
      
      To fix the case like above, this patch check whether target nodes AND
      current task_nodes is empty, and then check whether AND
      node_states[N_MEMORY] is empty.
      
      Link: http://lkml.kernel.org/r/1510882624-44342-4-git-send-email-xieyisheng1@huawei.com
      Signed-off-by: default avatarYisheng Xie <xieyisheng1@huawei.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Chris Salls <salls@cs.ucsb.edu>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Tan Xiaojun <tanxiaojun@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0486a38b
    • Yisheng Xie's avatar
      mm/mempolicy: fix the check of nodemask from user · 56521e7a
      Yisheng Xie authored
      
      
      As Xiaojun reported the ltp of migrate_pages01 will fail on arm64 system
      which has 4 nodes[0...3], all have memory and CONFIG_NODES_SHIFT=2:
      
        migrate_pages01    0  TINFO  :  test_invalid_nodes
        migrate_pages01   14  TFAIL  :  migrate_pages_common.c:45: unexpected failure - returned value = 0, expected: -1
        migrate_pages01   15  TFAIL  :  migrate_pages_common.c:55: call succeeded unexpectedly
      
      In this case the test_invalid_nodes of migrate_pages01 will call:
      SYSC_migrate_pages as:
      
        migrate_pages(0, , {0x0000000000000001}, 64, , {0x0000000000000010}, 64) = 0
      
      The new nodes specifies one or more node IDs that are greater than the
      maximum supported node ID, however, the errno is not set to EINVAL as
      expected.
      
      As man pages of set_mempolicy[1], mbind[2], and migrate_pages[3]
      mentioned, when nodemask specifies one or more node IDs that are greater
      than the maximum supported node ID, the errno should set to EINVAL.
      However, get_nodes only check whether the part of bits
      [BITS_PER_LONG*BITS_TO_LONGS(MAX_NUMNODES), maxnode) is zero or not, and
      remain [MAX_NUMNODES, BITS_PER_LONG*BITS_TO_LONGS(MAX_NUMNODES)
      unchecked.
      
      This patch is to check the bits of [MAX_NUMNODES, maxnode) in get_nodes
      to let migrate_pages set the errno to EINVAL when nodemask specifies one
      or more node IDs that are greater than the maximum supported node ID,
      which follows the manpage's guide.
      
      [1] http://man7.org/linux/man-pages/man2/set_mempolicy.2.html
      [2] http://man7.org/linux/man-pages/man2/mbind.2.html
      [3] http://man7.org/linux/man-pages/man2/migrate_pages.2.html
      
      Link: http://lkml.kernel.org/r/1510882624-44342-3-git-send-email-xieyisheng1@huawei.com
      Signed-off-by: default avatarYisheng Xie <xieyisheng1@huawei.com>
      Reported-by: default avatarTan Xiaojun <tanxiaojun@huawei.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Chris Salls <salls@cs.ucsb.edu>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      56521e7a
    • Yisheng Xie's avatar
      mm/mempolicy: remove redundant check in get_nodes · 66f308ed
      Yisheng Xie authored
      
      
      We have already checked whether maxnode is a page worth of bits, by:
          maxnode > PAGE_SIZE*BITS_PER_BYTE
      
      So no need to check it once more.
      
      Link: http://lkml.kernel.org/r/1510882624-44342-2-git-send-email-xieyisheng1@huawei.com
      Signed-off-by: default avatarYisheng Xie <xieyisheng1@huawei.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Chris Salls <salls@cs.ucsb.edu>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Tan Xiaojun <tanxiaojun@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      66f308ed
    • Pavel Tatashin's avatar
      mm: relax deferred struct page requirements · 2e3ca40f
      Pavel Tatashin authored
      
      
      There is no need to have ARCH_SUPPORTS_DEFERRED_STRUCT_PAGE_INIT, as all
      the page initialization code is in common code.
      
      Also, there is no need to depend on MEMORY_HOTPLUG, as initialization
      code does not really use hotplug memory functionality.  So, we can
      remove this requirement as well.
      
      This patch allows to use deferred struct page initialization on all
      platforms with memblock allocator.
      
      Tested on x86, arm64, and sparc.  Also, verified that code compiles on
      PPC with CONFIG_MEMORY_HOTPLUG disabled.
      
      Link: http://lkml.kernel.org/r/20171117014601.31606-1-pasha.tatashin@oracle.com
      Signed-off-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>	[s390]
      Reviewed-by: default avatarKhalid Aziz <khalid.aziz@oracle.com>
      Acked-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2e3ca40f
    • Srividya Desireddy's avatar
      zswap: same-filled pages handling · a85f878b
      Srividya Desireddy authored
      
      
      Zswap is a cache which compresses the pages that are being swapped out
      and stores them into a dynamically allocated RAM-based memory pool.
      Experiments have shown that around 10-20% of pages stored in zswap are
      same-filled pages (i.e.  contents of the page are all same), but these
      pages are handled as normal pages by compressing and allocating memory
      in the pool.
      
      This patch adds a check in zswap_frontswap_store() to identify
      same-filled page before compression of the page.  If the page is a
      same-filled page, set zswap_entry.length to zero, save the same-filled
      value and skip the compression of the page and alloction of memory in
      zpool.  In zswap_frontswap_load(), check if value of zswap_entry.length
      is zero corresponding to the page to be loaded.  If zswap_entry.length
      is zero, fill the page with same-filled value.  This saves the
      decompression time during load.
      
      On a ARM Quad Core 32-bit device with 1.5GB RAM by launching and
      relaunching different applications, out of ~64000 pages stored in zswap,
      ~11000 pages were same-value filled pages (including zero-filled pages)
      and ~9000 pages were zero-filled pages.
      
      An average of 17% of pages(including zero-filled pages) in zswap are
      same-value filled pages and 14% pages are zero-filled pages.  An average
      of 3% of pages are same-filled non-zero pages.
      
      The below table shows the execution time profiling with the patch.
      
                                  Baseline    With patch  % Improvement
        -----------------------------------------------------------------
        *Zswap Store Time           26.5ms       18ms          32%
         (of same value pages)
        *Zswap Load Time
         (of same value pages)      25.5ms       13ms          49%
        -----------------------------------------------------------------
      
      On Ubuntu PC with 2GB RAM, while executing kernel build and other test
      scripts and running multimedia applications, out of 360000 pages stored
      in zswap 78000(~22%) of pages were found to be same-value filled pages
      (including zero-filled pages) and 64000(~17%) are zero-filled pages.  So
      an average of %5 of pages are same-filled non-zero pages.
      
      The below table shows the execution time profiling with the patch.
      
                                  Baseline    With patch  % Improvement
        -----------------------------------------------------------------
        *Zswap Store Time           91ms        74ms           19%
         (of same value pages)
        *Zswap Load Time            50ms        7.5ms          85%
         (of same value pages)
        -----------------------------------------------------------------
      
      *The execution times may vary with test device used.
      
      Dan said:
      
      : I did test this patch out this week, and I added some instrumentation to
      : check the performance impact, and tested with a small program to try to
      : check the best and worst cases.
      :
      : When doing a lot of swap where all (or almost all) pages are same-value, I
      : found this patch does save both time and space, significantly.  The exact
      : improvement in time and space depends on which compressor is being used,
      : but roughly agrees with the numbers you listed.
      :
      : In the worst case situation, where all (or almost all) pages have the
      : same-value *except* the final long (meaning, zswap will check each long on
      : the entire page but then still have to pass the page to the compressor),
      : the same-value check is around 10-15% of the total time spent in
      : zswap_frontswap_store().  That's a not-insignificant amount of time, but
      : it's not huge.  Considering that most systems will probably be swapping
      : pages that aren't similar to the worst case (although I don't have any
      : data to know that), I'd say the improvement is worth the possible
      : worst-case performance impact.
      
      [srividya.dr@samsung.com: add memset_l instead of for loop]
      Link: http://lkml.kernel.org/r/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1
      Signed-off-by: default avatarSrividya Desireddy <srividya.dr@samsung.com>
      Acked-by: default avatarDan Streetman <ddstreet@ieee.org>
      Cc: Seth Jennings <sjenning@redhat.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Dinakar Reddy Pathireddy <dinakar.p@samsung.com>
      Cc: SHARAN ALLUR <sharan.allur@samsung.com>
      Cc: RAJIB BASU <rajib.basu@samsung.com>
      Cc: JUHUN KIM <juhunkim@samsung.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Timofey Titovets <nefelim4ag@gmail.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a85f878b
    • Yang Shi's avatar
      mm: kmemleak: remove unused hardirq.h · 4a01768e
      Yang Shi authored
      
      
      Preempt counter APIs have been split out, currently, hardirq.h just
      includes irq_enter/exit APIs which are not used by kmemleak at all.
      
      So, remove the unused hardirq.h.
      
      Link: http://lkml.kernel.org/r/1510959741-31109-1-git-send-email-yang.s@alibaba-inc.com
      Signed-off-by: default avatarYang Shi <yang.s@alibaba-inc.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4a01768e
    • Andrew Morton's avatar
      include/linux/sched/mm.h: uninline mmdrop_async(), etc · d70f2a14
      Andrew Morton authored
      
      
      mmdrop_async() is only used in fork.c.  Move that and its support
      functions into fork.c, uninline it all.
      
      Quite a lot of code gets moved around to avoid forward declarations.
      
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d70f2a14
    • Miles Chen's avatar
      slub: remove obsolete comments of put_cpu_partial() · 0d2d5d40
      Miles Chen authored
      Commit d6e0b7fa
      
       ("slub: make dead caches discard free slabs
      immediately") makes put_cpu_partial() run with preemption disabled and
      interrupts disabled when calling unfreeze_partials().
      
      The comment: "put_cpu_partial() is done without interrupts disabled and
      without preemption disabled" looks obsolete, so remove it.
      
      Link: http://lkml.kernel.org/r/1516968550-1520-1-git-send-email-miles.chen@mediatek.com
      Signed-off-by: default avatarMiles Chen <miles.chen@mediatek.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0d2d5d40
    • Balasubramani Vivekanandan's avatar
      mm/slub.c: fix wrong address during slab padding restoration · 5d682681
      Balasubramani Vivekanandan authored
      
      
      Start address calculated for slab padding restoration was wrong.  Wrong
      address would point to some section before padding and could cause
      corruption
      
      Link: http://lkml.kernel.org/r/1516604578-4577-1-git-send-email-balasubramani_vivekanandan@mentor.com
      Signed-off-by: default avatarBalasubramani Vivekanandan <balasubramani_vivekanandan@mentor.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5d682681
    • Oscar Salvador's avatar
      mm/slab.c: remove redundant assignments for slab_state · 84ebb582
      Oscar Salvador authored
      
      
      slab_state is being set to "UP" in create_kmalloc_caches(), and later on
      we set it again in kmem_cache_init_late(), but slab_state does not
      change in the meantime.
      
      Remove the redundant assignment from kmem_cache_init_late().
      
      And unless I overlooked anything, the same goes for "slab_state = FULL".
      slab_state is set to "FULL" in kmem_cache_init_late(), but it is later
      being set again in cpucache_init(), which gets called from
      do_initcall_level().  So remove the assignment from cpucache_init() as
      well.
      
      Link: http://lkml.kernel.org/r/20171215134452.GA1920@techadventures.net
      Signed-off-by: default avatarOscar Salvador <osalvador@techadventures.net>
      Acked-by: default avatarChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      84ebb582
    • Byongho Lee's avatar
      mm/slab_common.c: make calculate_alignment() static · 692ae74a
      Byongho Lee authored
      
      
      calculate_alignment() function is only used inside slab_common.c.  So
      make it static and let the compiler do more optimizations.
      
      After this patch there's a small improvement in text and data size.
      
        $ gcc --version
          gcc (GCC) 7.2.1 20171128
      
      Before:
        text	   data	    bss	    dec	     hex	filename
        9890457  3828702  1212364 14931523 e3d643	vmlinux
      
      After:
        text	   data	    bss	    dec	     hex	filename
        9890437  3828670  1212364 14931471 e3d60f	vmlinux
      
      Also I fixed a style problem reported by checkpatch.
      
        WARNING: Missing a blank line after declarations
        #53: FILE: mm/slab_common.c:286:
        +		unsigned long ralign = cache_line_size();
        +		while (size <= ralign / 2)
      
      Link: http://lkml.kernel.org/r/20171210080132.406-1-bhlee.kernel@gmail.com
      Signed-off-by: default avatarByongho Lee <bhlee.kernel@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      692ae74a
    • piaojun's avatar
      ocfs2: return error when we attempt to access a dirty bh in jbd2 · d984187e
      piaojun authored
      We should not reuse the dirty bh in jbd2 directly due to the following
      situation:
      
      1. When removing extent rec, we will dirty the bhs of extent rec and
         truncate log at the same time, and hand them over to jbd2.
      
      2. The bhs are submitted to jbd2 area successfully.
      
      3. The write-back thread of device help flush the bhs to disk but
         encounter write error due to abnormal storage link.
      
      4. After a while the storage link become normal. Truncate log flush
         worker triggered by the next space reclaiming found the dirty bh of
         truncate log and clear its 'BH_Write_EIO' and then set it uptodate in
         __ocfs2_journal_access():
      
         ocfs2_truncate_log_worker
           ocfs2_flush_truncate_log
             __ocfs2_flush_truncate_log
               ocfs2_replay_truncate_records
                 ocfs2_journal_access_di
                   __ocfs2_journal_access // here we clear io_error and set 'tl_bh' uptodata.
      
      5. Then jbd2 will flush the bh of truncate log to disk, but the bh of
         extent rec is still in error state, and unfortunately nobody will
         take care of it.
      
      6. At last the space of extent rec was not reduced, but truncate log
         flush worker have given it back to globalalloc. That will cause
         duplicate cluster problem which could be identified by fsck.ocfs2.
      
      Sadly we can hardly revert this but set fs read-only in case of ruining
      atomicity and consistency of space reclaim.
      
      Link: http://lkml.kernel.org/r/5A6E8092.8090701@huawei.com
      Fixes: acf8fdbe
      
       ("ocfs2: do not BUG if buffer not uptodate in __ocfs2_journal_access")
      Signed-off-by: default avatarJun Piao <piaojun@huawei.com>
      Reviewed-by: default avatarYiwen Jiang <jiangyiwen@huawei.com>
      Reviewed-by: default avatarChangwei Ge <ge.changwei@h3c.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d984187e
    • Changwei Ge's avatar
      ocfs2: unlock bh_state if bg check fails · e75ed71b
      Changwei Ge authored
      
      
      We should unlock bh_stat if bg->bg_free_bits_count > bg->bg_bits
      
      Link: http://lkml.kernel.org/r/1516843095-23680-1-git-send-email-ge.changwei@h3c.com
      Signed-off-by: default avatarChangwei Ge <ge.changwei@h3c.com>
      Suggested-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Cc: Changwei Ge <ge.changwei@h3c.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e75ed71b
    • Gang He's avatar
      ocfs2: nowait aio support · c4c2416a
      Gang He authored
      
      
      Return EAGAIN if any of the following checks fail for direct I/O:
      
       - Cannot get the related locks immediately
      
       - Blocks are not allocated at the write location, it will trigger block
         allocation and block IO operations.
      
      [ghe@suse.com: v4]
        Link: http://lkml.kernel.org/r/1516007283-29932-4-git-send-email-ghe@suse.com
      [ghe@suse.com: v2]
        Link: http://lkml.kernel.org/r/1511944612-9629-4-git-send-email-ghe@suse.com
      Link: http://lkml.kernel.org/r/1511775987-841-4-git-send-email-ghe@suse.com
      Signed-off-by: default avatarGang He <ghe@suse.com>
      Reviewed-by: default avatarAlex Chen <alex.chen@huawei.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Cc: Changwei Ge <ge.changwei@h3c.com>
      Cc: Jun Piao <piaojun@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c4c2416a
    • Gang He's avatar
      ocfs2: add ocfs2_overwrite_io() · ac604d3c
      Gang He authored
      
      
      Add ocfs2_overwrite_io function, which is used to judge if overwrite
      allocated blocks, otherwise, the write will bring extra block allocation
      overhead.
      
      [ghe@suse.com: v3]
        Link: http://lkml.kernel.org/r/1514455665-16325-3-git-send-email-ghe@suse.com
      [ghe@suse.com: v2]
        Link: http://lkml.kernel.org/r/1511944612-9629-3-git-send-email-ghe@suse.com
      Link: http://lkml.kernel.org/r/1511775987-841-3-git-send-email-ghe@suse.com
      Signed-off-by: default avatarGang He <ghe@suse.com>
      Reviewed-by: default avatarChangwei Ge <ge.changwei@h3c.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Cc: Jun Piao <piaojun@huawei.com>
      Cc: alex chen <alex.chen@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ac604d3c
    • Gang He's avatar
      ocfs2: add ocfs2_try_rw_lock() and ocfs2_try_inode_lock() · 06e7f13d
      Gang He authored
      
      
      Patch series "ocfs2: add nowait aio support", v4.
      
      VFS layer has introduced the non-blocking aio flag IOCB_NOWAIT, which
      tells the kernel to bail out if an AIO request will block for reasons
      such as file allocations, or writeback triggering, or would block while
      allocating requests while performing direct I/O.
      
      Subsequently, pwritev2/preadv2 also can leverage this part of kernel
      code.  So far, ext4/xfs/btrfs have supported this feature.  Add the
      related code for the ocfs2 file system.
      
      This patch (of 3):
      
      Add ocfs2_try_rw_lock and ocfs2_try_inode_lock functions, which will be
      used in non-blocking IO scenarios.
      
      [ghe@suse.com: v2]
        Link: http://lkml.kernel.org/r/1511944612-9629-2-git-send-email-ghe@suse.com
      Link: http://lkml.kernel.org/r/1511775987-841-2-git-send-email-ghe@suse.com
      Signed-off-by: default avatarGang He <ghe@suse.com>
      Reviewed-by: default avatarJun Piao <piaojun@huawei.com>
      Acked-by: default avataralex chen <alex.chen@huawei.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Cc: Changwei Ge <ge.changwei@h3c.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      06e7f13d
    • Gang He's avatar
      ocfs2: add trimfs lock to avoid duplicated trims in cluster · 637dd20c
      Gang He authored
      
      
      ocfs2 supports trimming the underlying disk via the fstrim command.  But
      there is a problem, ocfs2 is a shared disk cluster file system, if the
      user configures a scheduled fstrim job on each file system node, this
      will trigger multiple nodes trimming a shared disk simultaneously, which
      is very wasteful for CPU and IO consumption.  This also might negatively
      affect the lifetime of poor-quality SSD devices.
      
      So we introduce a trimfs dlm lock to communicate with each other in this
      case, which will make only one fstrim command to do the trimming on a
      shared disk among the cluster.  The fstrim commands from the other nodes
      should wait for the first fstrim to finish and return success directly,
      to avoid running the same trim on the shared disk again.
      
      Link: http://lkml.kernel.org/r/1513228484-2084-2-git-send-email-ghe@suse.com
      Signed-off-by: default avatarGang He <ghe@suse.com>
      Reviewed-by: default avatarChangwei Ge <ge.changwei@h3c.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      637dd20c
    • Gang He's avatar
      ocfs2: add trimfs dlm lock resource · 4882abeb
      Gang He authored
      
      
      Introduce a new dlm lock resource, which will be used to communicate
      during fstrimming of an ocfs2 device from cluster nodes.
      
      Link: http://lkml.kernel.org/r/1513228484-2084-1-git-send-email-ghe@suse.com
      Signed-off-by: default avatarGang He <ghe@suse.com>
      Reviewed-by: default avatarChangwei Ge <ge.changwei@h3c.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4882abeb
    • Changwei Ge's avatar
      ocfs2: try to reuse extent block in dealloc without meta_alloc · 71a36944
      Changwei Ge authored
      
      
      A crash issue was reported by John Lightsey with a call trace as follows:
      
        ocfs2_split_extent+0x1ad3/0x1b40 [ocfs2]
        ocfs2_change_extent_flag+0x33a/0x470 [ocfs2]
        ocfs2_mark_extent_written+0x172/0x220 [ocfs2]
        ocfs2_dio_end_io+0x62d/0x910 [ocfs2]
        dio_complete+0x19a/0x1a0
        do_blockdev_direct_IO+0x19dd/0x1eb0
        __blockdev_direct_IO+0x43/0x50
        ocfs2_direct_IO+0x8f/0xa0 [ocfs2]
        generic_file_direct_write+0xb2/0x170
        __generic_file_write_iter+0xc3/0x1b0
        ocfs2_file_write_iter+0x4bb/0xca0 [ocfs2]
        __vfs_write+0xae/0xf0
        vfs_write+0xb8/0x1b0
        SyS_write+0x4f/0xb0
        system_call_fastpath+0x16/0x75
      
      The BUG code told that extent tree wants to grow but no metadata was
      reserved ahead of time.  From my investigation into this issue, the root
      cause it that although enough metadata is not reserved, there should be
      enough for following use.  Rightmost extent is merged into its left one
      due to a certain times of marking extent written.  Because during
      marking extent written, we got many physically continuous extents.  At
      last, an empty extent showed up and the rightmost path is removed from
      extent tree.
      
      Add a new mechanism to reuse extent block cached in dealloc which were
      just unlinked from extent tree to solve this crash issue.
      
      Criteria is that during marking extents *written*, if extent rotation
      and merging results in unlinking extent with growing extent tree later
      without any metadata reserved ahead of time, try to reuse those extents
      in dealloc in which deleted extents are cached.
      
      Also, this patch addresses the issue John reported that ::dw_zero_count
      is not calculated properly.
      
      After applying this patch, the issue John reported was gone.  Thanks for
      the reproducer provided by John.  And this patch has passed
      ocfs2-test(29 cases) suite running by New H3C Group.
      
      [ge.changwei@h3c.com: fix static checker warnning]
        Link: http://lkml.kernel.org/r/63ADC13FD55D6546B7DECE290D39E373F29196AE@H3CMLB12-EX.srv.huawei-3com.com
      [akpm@linux-foundation.org: brelse(NULL) is legal]
      Link: http://lkml.kernel.org/r/1515479070-32653-2-git-send-email-ge.changwei@h3c.com
      Signed-off-by: default avatarChangwei Ge <ge.changwei@h3c.com>
      Reported-by: default avatarJohn Lightsey <john@nixnuts.net>
      Tested-by: default avatarJohn Lightsey <john@nixnuts.net>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      71a36944
    • Changwei Ge's avatar
      ocfs2: make metadata estimation accurate and clear · 63de8bd9
      Changwei Ge authored
      
      
      Current code assume that ::w_unwritten_list always has only one item on.
      This is not right and hard to get understood.  So improve how to count
      unwritten item.
      
      Link: http://lkml.kernel.org/r/1515479070-32653-1-git-send-email-ge.changwei@h3c.com
      Signed-off-by: default avatarChangwei Ge <ge.changwei@h3c.com>
      Reported-by: default avatarJohn Lightsey <john@nixnuts.net>
      Tested-by: default avatarJohn Lightsey <john@nixnuts.net>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Changwei Ge <ge.changwei@h3c.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      63de8bd9
    • piaojun's avatar
      ocfs2/acl: use 'ip_xattr_sem' to protect getting extended attribute · 16c8d569
      piaojun authored
      
      
      The race between *set_acl and *get_acl will cause getting incomplete
      xattr data as below:
      
        processA                                    processB
      
        ocfs2_set_acl
          ocfs2_xattr_set
            __ocfs2_xattr_set_handle
      
                                                    ocfs2_get_acl_nolock
                                                      ocfs2_xattr_get_nolock:
      
      processB may get incomplete xattr data if processA hasn't set_acl done.
      
      So we should use 'ip_xattr_sem' to protect getting extended attribute in
      ocfs2_get_acl_nolock(), as other processes could be changing it
      concurrently.
      
      Link: http://lkml.kernel.org/r/5A5DDCFF.7030001@huawei.com
      Signed-off-by: default avatarJun Piao <piaojun@huawei.com>
      Reviewed-by: default avatarAlex Chen <alex.chen@huawei.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Cc: Changwei Ge <ge.changwei@h3c.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      16c8d569
    • Changwei Ge's avatar
      ocfs2: clean up dead code in alloc.c · d22aa615
      Changwei Ge authored
      
      
      Some stack variables are no longer used but still assigned.  Trim them.
      
      Link: http://lkml.kernel.org/r/1516105069-12643-1-git-send-email-ge.changwei@h3c.com
      Signed-off-by: default avatarChangwei Ge <ge.changwei@h3c.com>
      Reviewed-by: default avatarJun Piao <piaojun@huawei.com>
      Reviewed-by: default avatarAlex Chen <alex.chen@huawei.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d22aa615
    • piaojun's avatar
      ocfs2/xattr: assign errno to 'ret' in ocfs2_calc_xattr_init() · c0a1a6d7
      piaojun authored
      
      
      We need catch the errno returned by ocfs2_xattr_get_nolock() and assign
      it to 'ret' for printing and noticing upper callers.
      
      Link: http://lkml.kernel.org/r/5A571CAF.8050709@huawei.com
      Signed-off-by: default avatarJun Piao <piaojun@huawei.com>
      Reviewed-by: default avatarAlex Chen <alex.chen@huawei.com>
      Reviewed-by: default avatarYiwen Jiang <jiangyiwen@huawei.com>
      Acked-by: default avatarGang He <ghe@suse.com>
      Acked-by: default avatarChangwei Ge <ge.changwei@h3c.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c0a1a6d7
    • Gang He's avatar
      ocfs2: try a blocking lock before return AOP_TRUNCATED_PAGE · ff26cc10
      Gang He authored
      If we can't get inode lock immediately in the function
      ocfs2_inode_lock_with_page() when reading a page, we should not return
      directly here, since this will lead to a softlockup problem when the
      kernel is configured with CONFIG_PREEMPT is not set.  The method is to
      get a blocking lock and immediately unlock before returning, this can
      avoid CPU resource waste due to lots of retries, and benefits fairness
      in getting lock among multiple nodes, increase efficiency in case
      modifying the same file frequently from multiple nodes.
      
      The softlockup crash (when set /proc/sys/kernel/softlockup_panic to 1)
      looks like:
      
        Kernel panic - not syncing: softlockup: hung tasks
        CPU: 0 PID: 885 Comm: multi_mmap Tainted: G L 4.12.14-6.1-default #1
        Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
        Call Trace:
          <IRQ>
          dump_stack+0x5c/0x82
          panic+0xd5/0x21e
          watchdog_timer_fn+0x208/0x210
          __hrtimer_run_queues+0xcc/0x200
          hrtimer_interrupt+0xa6/0x1f0
          smp_apic_timer_interrupt+0x34/0x50
          apic_timer_interrupt+0x96/0xa0
          </IRQ>
         RIP: 0010:unlock_page+0x17/0x30
         RSP: 0000:ffffaf154080bc88 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
         RAX: dead000000000100 RBX: fffff21e009f5300 RCX: 0000000000000004
         RDX: dead0000000000ff RSI: 0000000000000202 RDI: fffff21e009f5300
         RBP: 0000000000000000 R08: 0000000000000000 R09: ffffaf154080bb00
         R10: ffffaf154080bc30 R11: 0000000000000040 R12: ffff993749a39518
         R13: 0000000000000000 R14: fffff21e009f5300 R15: fffff21e009f5300
          ocfs2_inode_lock_with_page+0x25/0x30 [ocfs2]
          ocfs2_readpage+0x41/0x2d0 [ocfs2]
          filemap_fault+0x12b/0x5c0
          ocfs2_fault+0x29/0xb0 [ocfs2]
          __do_fault+0x1a/0xa0
          __handle_mm_fault+0xbe8/0x1090
          handle_mm_fault+0xaa/0x1f0
          __do_page_fault+0x235/0x4b0
          trace_do_page_fault+0x3c/0x110
          async_page_fault+0x28/0x30
         RIP: 0033:0x7fa75ded638e
         RSP: 002b:00007ffd6657db18 EFLAGS: 00010287
         RAX: 000055c7662fb700 RBX: 0000000000000001 RCX: 000055c7662fb700
         RDX: 0000000000001770 RSI: 00007fa75e909000 RDI: 000055c7662fb700
         RBP: 0000000000000003 R08: 000000000000000e R09: 0000000000000000
         R10: 0000000000000483 R11: 00007fa75ded61b0 R12: 00007fa75e90a770
         R13: 000000000000000e R14: 0000000000001770 R15: 0000000000000000
      
      About performance improvement, we can see the testing time is reduced,
      and CPU utilization decreases, the detailed data is as follows.  I ran
      multi_mmap test case in ocfs2-test package in a three nodes cluster.
      
      Before applying this patch:
          PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
         2754 ocfs2te+  20   0  170248   6980   4856 D 80.73 0.341   0:18.71 multi_mmap
         1505 root      rt   0  222236 123060  97224 S 2.658 6.015   0:01.44 corosync
            5 root      20   0       0      0      0 S 1.329 0.000   0:00.19 kworker/u8:0
           95 root      20   0       0      0      0 S 1.329 0.000   0:00.25 kworker/u8:1
         2728 root      20   0       0      0      0 S 0.997 0.000   0:00.24 jbd2/sda1-33
         2721 root      20   0       0      0      0 S 0.664 0.000   0:00.07 ocfs2dc-3C8CFD4
         2750 ocfs2te+  20   0  142976   4652   3532 S 0.664 0.227   0:00.28 mpirun
      
        ocfs2test@tb-node2:~>multiple_run.sh -i ens3 -k ~/linux-4.4.21-69.tar.gz -o ~/ocfs2mullog -C hacluster -s pcmk -n tb-node2,tb-node1,tb-node3 -d /dev/sda1 -b 4096 -c 32768 -t multi_mmap /mnt/shared
        Tests with "-b 4096 -C 32768"
        Thu Dec 28 14:44:52 CST 2017
        multi_mmap..................................................Passed.
        Runtime 783 seconds.
      
      After apply this patch:
      
          PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
         2508 ocfs2te+  20   0  170248   6804   4680 R 54.00 0.333   0:55.37 multi_mmap
          155 root      20   0       0      0      0 S 2.667 0.000   0:01.20 kworker/u8:3
           95 root      20   0       0      0      0 S 2.000 0.000   0:01.58 kworker/u8:1
         2504 ocfs2te+  20   0  142976   4604   3480 R 1.667 0.225   0:01.65 mpirun
            5 root      20   0       0      0      0 S 1.000 0.000   0:01.36 kworker/u8:0
         2482 root      20   0       0      0      0 S 1.000 0.000   0:00.86 jbd2/sda1-33
          299 root       0 -20       0      0      0 S 0.333 0.000   0:00.13 kworker/2:1H
          335 root       0 -20       0      0      0 S 0.333 0.000   0:00.17 kworker/1:1H
          535 root      20   0   12140   7268   1456 S 0.333 0.355   0:00.34 haveged
         1282 root      rt   0  222284 123108  97224 S 0.333 6.017   0:01.33 corosync
      
        ocfs2test@tb-node2:~>multiple_run.sh -i ens3 -k ~/linux-4.4.21-69.tar.gz -o ~/ocfs2mullog -C hacluster -s pcmk -n tb-node2,tb-node1,tb-node3 -d /dev/sda1 -b 4096 -c 32768 -t multi_mmap /mnt/shared
        Tests with "-b 4096 -C 32768"
        Thu Dec 28 15:04:12 CST 2017
        multi_mmap..................................................Passed.
        Runtime 487 seconds.
      
      Link: http://lkml.kernel.org/r/1514447305-30814-1-git-send-email-ghe@suse.com
      Fixes: 1cce4df0
      
       ("ocfs2: do not lock/unlock() inode DLM lock")
      Signed-off-by: default avatarGang He <ghe@suse.com>
      Reviewed-by: default avatarEric Ren <zren@suse.com>
      Acked-by: default avataralex chen <alex.chen@huawei.com>
      Acked-by: default avatarpiaojun <piaojun@huawei.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Cc: Changwei Ge <ge.changwei@h3c.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ff26cc10
    • piaojun's avatar
      ocfs2: return -EROFS to mount.ocfs2 if inode block is invalid · 025bcbde
      piaojun authored
      
      
      If metadata is corrupted such as 'invalid inode block', we will get
      failed by calling 'mount()' and then set filesystem readonly as below:
      
        ocfs2_mount
          ocfs2_initialize_super
            ocfs2_init_global_system_inodes
              ocfs2_iget
                ocfs2_read_locked_inode
                  ocfs2_validate_inode_block
      	      ocfs2_error
      	        ocfs2_handle_error
      	          ocfs2_set_ro_flag(osb, 0);  // set readonly
      
      In this situation we need return -EROFS to 'mount.ocfs2', so that user
      can fix it by fsck.  And then mount again.  In addition, 'mount.ocfs2'
      should be updated correspondingly as it only return 1 for all errno.
      And I will post a patch for 'mount.ocfs2' too.
      
      Link: http://lkml.kernel.org/r/5A4302FA.2010606@huawei.com
      Signed-off-by: default avatarJun Piao <piaojun@huawei.com>
      Reviewed-by: default avatarAlex Chen <alex.chen@huawei.com>
      Reviewed-by: default avatarJoseph Qi <jiangqi903@gmail.com>
      Reviewed-by: default avatarChangwei Ge <ge.changwei@h3c.com>
      Reviewed-by: default avatarGang He <ghe@suse.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      025bcbde
    • Changwei Ge's avatar
      ocfs2: clean dead code in suballoc.c · dd7b5f9d
      Changwei Ge authored
      
      
      Stack variable fe is no longer used, so trim it to save some CPU cycles
      and stack space.
      
      Link: http://lkml.kernel.org/r/63ADC13FD55D6546B7DECE290D39E373F1F5A8DD@H3CMLB14-EX.srv.huawei-3com.com
      Signed-off-by: default avatarChangwei Ge <ge.changwei@h3c.com>
      Reviewed-by: default avatarJoseph Qi <jiangqi903@gmail.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Changwei Ge <ge.changwei@h3c.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dd7b5f9d
    • alex chen's avatar
      ocfs2: use the OCFS2_XATTR_ROOT_SIZE macro in ocfs2_reflink_xattr_header() · 32ed0bd7
      alex chen authored
      
      
      Use the OCFS2_XATTR_ROOT_SIZE macro improves the readability of the
      code.
      
      Link: http://lkml.kernel.org/r/5A2E2488.70301@huawei.com
      Signed-off-by: default avatarAlex Chen <alex.chen@huawei.com>
      Reviewed-by: default avatarJun Piao <piaojun@huawei.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Cc: Changwei Ge <ge.changwei@h3c.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      32ed0bd7
    • Yang Zhang's avatar
      ocfs2/cluster: close a race that fence can't be triggered · fc2af28b
      Yang Zhang authored
      
      
      When some nodes of cluster face with TCP connection fault, ocfs2 will
      pick up a quorum to continue to work and other nodes will be fenced by
      resetting host.
      
      In order to decide which node should be fenced, ocfs2 leverages
      o2quo_state::qs_holds.  If that variable is reduced to zero, then a try
      to decide if fence local node is performed.  However, under a specific
      scenario that local node is not disconnected from others at the same
      time, above method has a problem to reduce ::qs_holds to zero.
      
      Because, o2net 90s idle timer corresponding to different nodes is
      triggered one after another.
      
        node 2			node 3
        90s idle timer elapses
        clear ::qs_conn_bm
        set hold
      				40s is passed
      				90 idle timer elapses
      				clear ::qs_conn_bm
      				set hold
        still up timer elapses
        clear hold (NOT to zero )
        90s idle timer elapses AGAIN
      				still up timer elapses.
      				clear hold
      				still up timer elapses
      
      To solve this issue, a node which has already be evicted from
      ::qs_conn_bm can't set hold again and again invoked from idle timer.
      
      Link: http://lkml.kernel.org/r/63ADC13FD55D6546B7DECE290D39E373F1F3F93B@H3CMLB12-EX.srv.huawei-3com.com
      Signed-off-by: default avatarYang Zhang <zhang.yangB@h3c.com>
      Signed-off-by: default avatarChangwei Ge <ge.changwei@h3c.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fc2af28b
    • Gang He's avatar
      ocfs2: give an obvious tip for mismatched cluster names · a52370b3
      Gang He authored
      
      
      Add an obvious error message, due to mismatched cluster names between
      on-disk and in the current cluster.  We can meet this case during OCFS2
      cluster migration.
      
      If we can give the user an obvious tip for why they can not mount the
      file system after migration, they can quickly fix this mismatch problem.
      
      Second, also move printing ocfs2_fill_super() errno to the front of
      ocfs2_dismount_volume(), since ocfs2_dismount_volume() will also print
      its own message.
      
      I looked through all the code of OCFS2 (include o2cb); there is not any
      place which returns this error.  In fact, the function calling path
      ocfs2_fill_super -> ocfs2_mount_volume -> ocfs2_dlm_init ->
      dlm_new_lockspace is a very specific one.  We can use this errno to give
      the user a more clear tip, since this case is a little common during
      cluster migration, but the customer can quickly get the failure cause if
      there is a error printed.  Also, I think it is not possible to add this
      errno in the o2cb path during ocfs2_dlm_init(), since the o2cb code has
      been stable for a long time.
      
      We only print this error tip when the user uses pcmk stack, since using
      the o2cb stack the user will not meet this error.
      
      [ghe@suse.com: v2]
        Link: http://lkml.kernel.org/r/1495419305-3780-1-git-send-email-ghe@suse.com
      Link: http://lkml.kernel.org/r/1495089336-19312-1-git-send-email-ghe@suse.com
      Signed-off-by: default avatarGang He <ghe@suse.com>
      Reviewed-by: default avatarMark Fasheh <mfasheh@versity.com>
      Acked-by: default avatarJoseph Qi <jiangqi903@gmail.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a52370b3
    • Changwei Ge's avatar
      ocfs2/cluster: neaten a member of o2net_msg_handler · cfdce25c
      Changwei Ge authored
      
      
      It's odd that o2net_msg_handler::nh_func_data is declared as type
      o2net_msg_handler_func*.  So neaten it.
      
      Link: http://lkml.kernel.org/r/63ADC13FD55D6546B7DECE290D39E373F1F554DA@H3CMLB14-EX.srv.huawei-3com.com
      Signed-off-by: default avatarChangwei Ge <ge.changwei@h3c.com>
      Reviewed-by: default avatarJoseph Qi <jiangqi903@gmail.com>
      Reviewed-by: default avatarAlex Chen <alex.chen@huawei.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Changwei Ge <ge.changwei@h3c.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cfdce25c
    • Changwei Ge's avatar
      fs/ocfs2/dlm/dlmmaster.c: clean up dead code · e37b963c
      Changwei Ge authored
      
      
      This code has been commented out for 12 years.  Remove it.
      
      Link: http://lkml.kernel.org/r/63ADC13FD55D6546B7DECE290D39E373CED7EF9E@H3CMLB14-EX.srv.huawei-3com.com
      Signed-off-by: default avatarChangwei Ge <ge.changwei@h3c.com>
      Cc: Mark Fasheh <mfasheh@versity.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <jiangqi903@gmail.com>
      Cc: Changwei Ge <ge.changwei@h3c.com>
      Cc: alex chen <alex.chen@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e37b963c
    • Sudip Mukherjee's avatar
      m32r: remove abort() · d91dad45
      Sudip Mukherjee authored
      Commit 7c2c11b2
      
       ("arch: define weak abort()") has introduced a weak
      abort() which is common for all arch.  And, so we will not need arch
      specific abort which has the same code as the weak abort().  Remove the
      abort() for m32r.
      
      Link: http://lkml.kernel.org/r/1516912339-5665-1-git-send-email-sudipm.mukherjee@gmail.com
      Signed-off-by: default avatarSudip Mukherjee <sudipm.mukherjee@gmail.com>
      Reviewed-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d91dad45