Skip to content
  1. Sep 12, 2013
    • Johannes Weiner's avatar
      mm: page_alloc: fair zone allocator policy · 81c0a2bb
      Johannes Weiner authored
      
      
      Each zone that holds userspace pages of one workload must be aged at a
      speed proportional to the zone size.  Otherwise, the time an individual
      page gets to stay in memory depends on the zone it happened to be
      allocated in.  Asymmetry in the zone aging creates rather unpredictable
      aging behavior and results in the wrong pages being reclaimed, activated
      etc.
      
      But exactly this happens right now because of the way the page allocator
      and kswapd interact.  The page allocator uses per-node lists of all zones
      in the system, ordered by preference, when allocating a new page.  When
      the first iteration does not yield any results, kswapd is woken up and the
      allocator retries.  Due to the way kswapd reclaims zones below the high
      watermark while a zone can be allocated from when it is above the low
      watermark, the allocator may keep kswapd running while kswapd reclaim
      ensures that the page allocator can keep allocating from the first zone in
      the zonelist for extended periods of time.  Meanwhile the other zones
      rarely see new allocations and thus get aged much slower in comparison.
      
      The result is that the occasional page placed in lower zones gets
      relatively more time in memory, even gets promoted to the active list
      after its peers have long been evicted.  Meanwhile, the bulk of the
      working set may be thrashing on the preferred zone even though there may
      be significant amounts of memory available in the lower zones.
      
      Even the most basic test -- repeatedly reading a file slightly bigger than
      memory -- shows how broken the zone aging is.  In this scenario, no single
      page should be able stay in memory long enough to get referenced twice and
      activated, but activation happens in spades:
      
        $ grep active_file /proc/zoneinfo
            nr_inactive_file 0
            nr_active_file 0
            nr_inactive_file 0
            nr_active_file 8
            nr_inactive_file 1582
            nr_active_file 11994
        $ cat data data data data >/dev/null
        $ grep active_file /proc/zoneinfo
            nr_inactive_file 0
            nr_active_file 70
            nr_inactive_file 258753
            nr_active_file 443214
            nr_inactive_file 149793
            nr_active_file 12021
      
      Fix this with a very simple round robin allocator.  Each zone is allowed a
      batch of allocations that is proportional to the zone's size, after which
      it is treated as full.  The batch counters are reset when all zones have
      been tried and the allocator enters the slowpath and kicks off kswapd
      reclaim.  Allocation and reclaim is now fairly spread out to all
      available/allowable zones:
      
        $ grep active_file /proc/zoneinfo
            nr_inactive_file 0
            nr_active_file 0
            nr_inactive_file 174
            nr_active_file 4865
            nr_inactive_file 53
            nr_active_file 860
        $ cat data data data data >/dev/null
        $ grep active_file /proc/zoneinfo
            nr_inactive_file 0
            nr_active_file 0
            nr_inactive_file 666622
            nr_active_file 4988
            nr_inactive_file 190969
            nr_active_file 937
      
      When zone_reclaim_mode is enabled, allocations will now spread out to all
      zones on the local node, not just the first preferred zone (which on a 4G
      node might be a tiny Normal zone).
      
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Paul Bolle <paul.bollee@gmail.com>
      Cc: Zlatko Calusic <zcalusic@bitsync.net>
      Tested-by: default avatarKevin Hilman <khilman@linaro.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      81c0a2bb
    • Johannes Weiner's avatar
      mm: page_alloc: rearrange watermark checking in get_page_from_freelist · e085dbc5
      Johannes Weiner authored
      
      
      Allocations that do not have to respect the watermarks are rare
      high-priority events.  Reorder the code such that per-zone dirty limits
      and future checks important only to regular page allocations are ignored
      in these extraordinary situations.
      
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Paul Bolle <paul.bollee@gmail.com>
      Tested-by: default avatarZlatko Calusic <zcalusic@bitsync.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e085dbc5
    • Johannes Weiner's avatar
      mm: vmscan: fix numa reclaim balance problem in kswapd · 892f795d
      Johannes Weiner authored
      
      
      The way the page allocator interacts with kswapd creates aging imbalances,
      where the amount of time a userspace page gets in memory under reclaim
      pressure is dependent on which zone, which node the allocator took the
      page frame from.
      
      #1 fixes missed kswapd wakeups on NUMA systems, which lead to some
         nodes falling behind for a full reclaim cycle relative to the other
         nodes in the system
      
      #3 fixes an interaction where kswapd and a continuous stream of page
         allocations keep the preferred zone of a task between the high and
         low watermark (allocations succeed + kswapd does not go to sleep)
         indefinitely, completely underutilizing the lower zones and
         thrashing on the preferred zone
      
      These patches are the aging fairness part of the thrash-detection based
      file LRU balancing.  Andrea recommended to submit them separately as they
      are bugfixes in their own right.
      
      The following test ran a foreground workload (memcachetest) with
      background IO of various sizes on a 4 node 8G system (similar results were
      observed with single-node 4G systems):
      
      parallelio
                                                     BAS                    FAIRALLO
                                                    BASE                   FAIRALLOC
      Ops memcachetest-0M              5170.00 (  0.00%)           5283.00 (  2.19%)
      Ops memcachetest-791M            4740.00 (  0.00%)           5293.00 ( 11.67%)
      Ops memcachetest-2639M           2551.00 (  0.00%)           4950.00 ( 94.04%)
      Ops memcachetest-4487M           2606.00 (  0.00%)           3922.00 ( 50.50%)
      Ops io-duration-0M                  0.00 (  0.00%)              0.00 (  0.00%)
      Ops io-duration-791M               55.00 (  0.00%)             18.00 ( 67.27%)
      Ops io-duration-2639M             235.00 (  0.00%)            103.00 ( 56.17%)
      Ops io-duration-4487M             278.00 (  0.00%)            173.00 ( 37.77%)
      Ops swaptotal-0M                    0.00 (  0.00%)              0.00 (  0.00%)
      Ops swaptotal-791M             245184.00 (  0.00%)              0.00 (  0.00%)
      Ops swaptotal-2639M            468069.00 (  0.00%)         108778.00 ( 76.76%)
      Ops swaptotal-4487M            452529.00 (  0.00%)          76623.00 ( 83.07%)
      Ops swapin-0M                       0.00 (  0.00%)              0.00 (  0.00%)
      Ops swapin-791M                108297.00 (  0.00%)              0.00 (  0.00%)
      Ops swapin-2639M               169537.00 (  0.00%)          50031.00 ( 70.49%)
      Ops swapin-4487M               167435.00 (  0.00%)          34178.00 ( 79.59%)
      Ops minorfaults-0M            1518666.00 (  0.00%)        1503993.00 (  0.97%)
      Ops minorfaults-791M          1676963.00 (  0.00%)        1520115.00 (  9.35%)
      Ops minorfaults-2639M         1606035.00 (  0.00%)        1799717.00 (-12.06%)
      Ops minorfaults-4487M         1612118.00 (  0.00%)        1583825.00 (  1.76%)
      Ops majorfaults-0M                  6.00 (  0.00%)              0.00 (  0.00%)
      Ops majorfaults-791M            13836.00 (  0.00%)             10.00 ( 99.93%)
      Ops majorfaults-2639M           22307.00 (  0.00%)           6490.00 ( 70.91%)
      Ops majorfaults-4487M           21631.00 (  0.00%)           4380.00 ( 79.75%)
      
                       BAS    FAIRALLO
                      BASE   FAIRALLOC
      User          287.78      460.97
      System       2151.67     3142.51
      Elapsed      9737.00     8879.34
      
                                         BAS    FAIRALLO
                                        BASE   FAIRALLOC
      Minor Faults                  53721925    57188551
      Major Faults                    392195       15157
      Swap Ins                       2994854      112770
      Swap Outs                      4907092      134982
      Direct pages scanned                 0       41824
      Kswapd pages scanned          32975063     8128269
      Kswapd pages reclaimed         6323069     7093495
      Direct pages reclaimed               0       41824
      Kswapd efficiency                  19%         87%
      Kswapd velocity               3386.573     915.414
      Direct efficiency                 100%        100%
      Direct velocity                  0.000       4.710
      Percentage direct scans             0%          0%
      Zone normal velocity          2011.338     550.661
      Zone dma32 velocity           1365.623     369.221
      Zone dma velocity                9.612       0.242
      Page writes by reclaim    18732404.000  614807.000
      Page writes file              13825312      479825
      Page writes anon               4907092      134982
      Page reclaim immediate           85490        5647
      Sector Reads                  12080532      483244
      Sector Writes                 88740508    65438876
      Page rescued immediate               0           0
      Slabs scanned                    82560       12160
      Direct inode steals                  0           0
      Kswapd inode steals              24401       40013
      Kswapd skipped wait                  0           0
      THP fault alloc                      6           8
      THP collapse alloc                5481        5812
      THP splits                          75          22
      THP fault fallback                   0           0
      THP collapse fail                    0           0
      Compaction stalls                    0          54
      Compaction success                   0          45
      Compaction failures                  0           9
      Page migrate success            881492       82278
      Page migrate failure                 0           0
      Compaction pages isolated            0       60334
      Compaction migrate scanned           0       53505
      Compaction free scanned              0     1537605
      Compaction cost                    914          86
      NUMA PTE updates              46738231    41988419
      NUMA hint faults              31175564    24213387
      NUMA hint local faults        10427393     6411593
      NUMA pages migrated             881492       55344
      AutoNUMA cost                   156221      121361
      
      The overall runtime was reduced, throughput for both the foreground
      workload as well as the background IO improved, major faults, swapping and
      reclaim activity shrunk significantly, reclaim efficiency more than
      quadrupled.
      
      This patch:
      
      When the page allocator fails to get a page from all zones in its given
      zonelist, it wakes up the per-node kswapds for all zones that are at their
      low watermark.
      
      However, with a system under load the free pages in a zone can fluctuate
      enough that the allocation fails but the kswapd wakeup is also skipped
      while the zone is still really close to the low watermark.
      
      When one node misses a wakeup like this, it won't be aged before all the
      other node's zones are down to their low watermarks again.  And skipping a
      full aging cycle is an obvious fairness problem.
      
      Kswapd runs until the high watermarks are restored, so it should also be
      woken when the high watermarks are not met.  This ages nodes more equally
      and creates a safety margin for the page counter fluctuation.
      
      By using zone_balanced(), it will now check, in addition to the watermark,
      if compaction requires more order-0 pages to create a higher order page.
      
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Paul Bolle <paul.bollee@gmail.com>
      Tested-by: default avatarZlatko Calusic <zcalusic@bitsync.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      892f795d
    • Libin's avatar
      mm/huge_memory.c: fix potential NULL pointer dereference · a8f531eb
      Libin authored
      
      
      In collapse_huge_page() there is a race window between releasing the
      mmap_sem read lock and taking the mmap_sem write lock, so find_vma() may
      return NULL.  So check the return value to avoid NULL pointer dereference.
      
      collapse_huge_page
      	khugepaged_alloc_page
      		up_read(&mm->mmap_sem)
      	down_write(&mm->mmap_sem)
      	vma = find_vma(mm, address)
      
      Signed-off-by: default avatarLibin <huawei.libin@huawei.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarWanpeng Li <liwanp@linux.vnet.ibm.com>
      Reviewed-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: <stable@vger.kernel.org> # v3.0+
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a8f531eb
    • Yinghai Lu's avatar
      mm: kill one if loop in __free_pages_bootmem() · e2d0bd2b
      Yinghai Lu authored
      
      
      We should not check loop+1 with loop end in loop body.  Just duplicate two
      lines code to avoid it.
      
      That will help a bit when we have huge amount of pages on system with
      16TiB memory.
      
      Signed-off-by: default avatarYinghai Lu <yinghai@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e2d0bd2b
    • Srivatsa S. Bhat's avatar
      mm/page_alloc.c: fix the value of fallback_migratetype in alloc_extfrag tracepoint() · f92310c1
      Srivatsa S. Bhat authored
      
      
      In the current code, the value of fallback_migratetype that is printed
      using the mm_page_alloc_extfrag tracepoint, is the value of the
      migratetype *after* it has been set to the preferred migratetype (if the
      ownership was changed).  Obviously that wouldn't have been the original
      intent.  (We already have a separate 'change_ownership' field to tell
      whether the ownership of the pageblock was changed from the
      fallback_migratetype to the preferred type.)
      
      The intent of the fallback_migratetype field is to show the migratetype
      from which we borrowed pages in order to satisfy the allocation request.
      So fix the code to print that value correctly.
      
      Signed-off-by: default avatarSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Cody P Schafer <cody@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f92310c1
    • Srivatsa S. Bhat's avatar
      mm/page_allo.c: restructure free-page stealing code and fix a bug · fef903ef
      Srivatsa S. Bhat authored
      
      
      The free-page stealing code in __rmqueue_fallback() is somewhat hard to
      follow, and has an incredible amount of subtlety hidden inside!
      
      First off, there is a minor bug in the reporting of change-of-ownership of
      pageblocks.  Under some conditions, we try to move upto
      'pageblock_nr_pages' no.  of pages to the preferred allocation list.  But
      we change the ownership of that pageblock to the preferred type only if we
      manage to successfully move atleast half of that pageblock (or if
      page_group_by_mobility_disabled is set).
      
      However, the current code ignores the latter part and sets the
      'migratetype' variable to the preferred type, irrespective of whether we
      actually changed the pageblock migratetype of that block or not.  So, the
      page_alloc_extfrag tracepoint can end up printing incorrect info (i.e.,
      'change_ownership' might be shown as 1 when it must have been 0).
      
      So fixing this involves moving the update of the 'migratetype' variable to
      the right place.  But looking closer, we observe that the 'migratetype'
      variable is used subsequently for checks such as "is_migrate_cma()".
      Obviously the intent there is to check if the *fallback* type is
      MIGRATE_CMA, but since we already set the 'migratetype' variable to
      start_migratetype, we end up checking if the *preferred* type is
      MIGRATE_CMA!!
      
      To make things more interesting, this actually doesn't cause a bug in
      practice, because we never change *anything* if the fallback type is CMA.
      
      So, restructure the code in such a way that it is trivial to understand
      what is going on, and also fix the above mentioned bug.  And while at it,
      also add a comment explaining the subtlety behind the migratetype used in
      the call to expand().
      
      [akpm@linux-foundation.org: remove unneeded `inline', small coding-style fix]
      Signed-off-by: default avatarSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Cody P Schafer <cody@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fef903ef
    • Pintu Kumar's avatar
      mm/page_alloc.c: fix coding style and spelling · b8af2941
      Pintu Kumar authored
      
      
      Fix all errors reported by checkpatch and some small spelling mistakes.
      
      Signed-off-by: default avatarPintu Kumar <pintu.k@samsung.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b8af2941
    • Shaohua Li's avatar
      swap: make cluster allocation per-cpu · ebc2a1a6
      Shaohua Li authored
      
      
      swap cluster allocation is to get better request merge to improve
      performance.  But the cluster is shared globally, if multiple tasks are
      doing swap, this will cause interleave disk access.  While multiple tasks
      swap is quite common, for example, each numa node has a kswapd thread
      doing swap and multiple threads/processes doing direct page reclaim.
      
      ioscheduler can't help too much here, because tasks don't send swapout IO
      down to block layer in the meantime.  Block layer does merge some IOs, but
      a lot not, depending on how many tasks are doing swapout concurrently.  In
      practice, I've seen a lot of small size IO in swapout workloads.
      
      We makes the cluster allocation per-cpu here.  The interleave disk access
      issue goes away.  All tasks swapout to their own cluster, so swapout will
      become sequential, which can be easily merged to big size IO.  If one CPU
      can't get its per-cpu cluster (for example, there is no free cluster
      anymore in the swap), it will fallback to scan swap_map.  The CPU can
      still continue swap.  We don't need recycle free swap entries of other
      CPUs.
      
      In my test (swap to a 2-disk raid0 partition), this improves around 10%
      swapout throughput, and request size is increased significantly.
      
      How does this impact swap readahead is uncertain though.  On one side,
      page reclaim always isolates and swaps several adjancent pages, this will
      make page reclaim write the pages sequentially and benefit readahead.  On
      the other side, several CPU write pages interleave means the pages don't
      live _sequentially_ but relatively _near_.  In the per-cpu allocation
      case, if adjancent pages are written by different cpus, they will live
      relatively _far_.  So how this impacts swap readahead depends on how many
      pages page reclaim isolates and swaps one time.  If the number is big,
      this patch will benefit swap readahead.  Of course, this is about
      sequential access pattern.  The patch has no impact for random access
      pattern, because the new cluster allocation algorithm is just for SSD.
      
      Alternative solution is organizing swap layout to be per-mm instead of
      this per-cpu approach.  In the per-mm layout, we allocate a disk range for
      each mm, so pages of one mm live in swap disk adjacently.  per-mm layout
      has potential issues of lock contention if multiple reclaimers are swap
      pages from one mm.  For a sequential workload, per-mm layout is better to
      implement swap readahead, because pages from the mm are adjacent in disk.
      But per-cpu layout isn't very bad in this workload, as page reclaim always
      isolates and swaps several pages one time, such pages will still live in
      disk sequentially and readahead can utilize this.  For a random workload,
      per-mm layout isn't beneficial of request merge, because it's quite
      possible pages from different mm are swapout in the meantime and IO can't
      be merged in per-mm layout.  while with per-cpu layout we can merge
      requests from any mm.  Considering random workload is more popular in
      workloads with swap (and per-cpu approach isn't too bad for sequential
      workload too), I'm choosing per-cpu layout.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Kyungmin Park <kmpark@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ebc2a1a6
    • Shaohua Li's avatar
      swap: fix races exposed by swap discard · edfe23da
      Shaohua Li authored
      
      
      The previous patch can expose races, according to Hugh:
      
      swapoff was sometimes failing with "Cannot allocate memory", coming from
      try_to_unuse()'s -ENOMEM: it needs to allow for swap_duplicate() failing
      on a free entry temporarily SWAP_MAP_BAD while being discarded.
      
      We should use ACCESS_ONCE() there, and whenever accessing swap_map
      locklessly; but rather than peppering it throughout try_to_unuse(), just
      declare *swap_map with volatile.
      
      try_to_unuse() is accustomed to *swap_map going down racily, but not
      necessarily to it jumping up from 0 to SWAP_MAP_BAD: we'll be safer to
      prevent that transition once SWP_WRITEOK is switched off, when it's a
      waste of time to issue discards anyway (swapon can do a whole discard).
      
      Another issue is:
      
      In swapin_readahead(), read_swap_cache_async() can read a bad swap entry,
      because we don't check if readahead swap entry is bad.  This doesn't break
      anything but such swapin page is wasteful and can only be freed at page
      reclaim.  We should avoid read such swap entry.  And in discard, we mark
      swap entry SWAP_MAP_BAD and then switch it to normal when discard is
      finished.  If readahead reads such swap entry, we have the same issue, so
      we much check if swap entry is bad too.
      
      Thanks Hugh to inspire swapin_readahead could use bad swap entry.
      
      [include Hugh's patch 'swap: fix swapoff ENOMEMs from discard']
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Kyungmin Park <kmpark@infradead.org>
      Cc: Rafael Aquini <aquini@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      edfe23da
    • Shaohua Li's avatar
      swap: make swap discard async · 815c2c54
      Shaohua Li authored
      
      
      swap can do cluster discard for SSD, which is good, but there are some
      problems here:
      
      1. swap do the discard just before page reclaim gets a swap entry and
         writes the disk sectors.  This is useless for high end SSD, because an
         overwrite to a sector implies a discard to original sector too.  A
         discard + overwrite == overwrite.
      
      2. the purpose of doing discard is to improve SSD firmware garbage
         collection.  Idealy we should send discard as early as possible, so
         firmware can do something smart.  Sending discard just after swap entry
         is freed is considered early compared to sending discard before write.
         Of course, if workload is already bound to gc speed, sending discard
         earlier or later doesn't make
      
      3. block discard is a sync API, which will delay scan_swap_map()
         significantly.
      
      4. Write and discard command can be executed parallel in PCIe SSD.
         Making swap discard async can make execution more efficiently.
      
      This patch makes swap discard async and moves discard to where swap entry
      is freed.  Discard and write have no dependence now, so above issues can
      be avoided.  Idealy we should do discard for any freed sectors, but some
      SSD discard is very slow.  This patch still does discard for a whole
      cluster.
      
      My test does a several round of 'mmap, write, unmap', which will trigger a
      lot of swap discard.  In a fusionio card, with this patch, the test
      runtime is reduced to 18% of the time without it, so around 5.5x faster.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Kyungmin Park <kmpark@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      815c2c54
    • Shaohua Li's avatar
      swap: change block allocation algorithm for SSD · 2a8f9449
      Shaohua Li authored
      
      
      I'm using a fast SSD to do swap.  scan_swap_map() sometimes uses up to
      20~30% CPU time (when cluster is hard to find, the CPU time can be up to
      80%), which becomes a bottleneck.  scan_swap_map() scans a byte array to
      search a 256 page cluster, which is very slow.
      
      Here I introduced a simple algorithm to search cluster.  Since we only
      care about 256 pages cluster, we can just use a counter to track if a
      cluster is free.  Every 256 pages use one int to store the counter.  If
      the counter of a cluster is 0, the cluster is free.  All free clusters
      will be added to a list, so searching cluster is very efficient.  With
      this, scap_swap_map() overhead disappears.
      
      This might help low end SD card swap too.  Because if the cluster is
      aligned, SD firmware can do flash erase more efficiently.
      
      We only enable the algorithm for SSD.  Hard disk swap isn't fast enough
      and has downside with the algorithm which might introduce regression (see
      below).
      
      The patch slightly changes which cluster is choosen.  It always adds free
      cluster to list tail.  This can help wear leveling for low end SSD too.
      And if no cluster found, the scan_swap_map() will do search from the end
      of last cluster.  So if no cluster found, the scan_swap_map() will do
      search from the end of last free cluster, which is random.  For SSD, this
      isn't a problem at all.
      
      Another downside is the cluster must be aligned to 256 pages, which will
      reduce the chance to find a cluster.  I would expect this isn't a big
      problem for SSD because of the non-seek penality.  (And this is the reason
      I only enable the algorithm for SSD).
      
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Kyungmin Park <kmpark@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2a8f9449
    • Chen Gang's avatar
      mm/page_alloc.c: use '__paginginit' instead of '__init' · 15ca220e
      Chen Gang authored
      
      
      set_pageblock_order() may be called when memory hotplug, so need use
      '__paginginit' instead of '__init'.
      
      The related warning:
      
        The function __meminit .free_area_init_node() references
        a function __init .set_pageblock_order().
        If .set_pageblock_order is only used by .free_area_init_node then
        annotate .set_pageblock_order with a matching annotation.
      
      Signed-off-by: default avatarChen Gang <gang.chen@asianux.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      15ca220e
    • Jerry Zhou's avatar
      mm: fix negative left shift count when PAGE_SHIFT > 20 · a7e83318
      Jerry Zhou authored
      
      
      When PAGE_SHIFT > 20, the result of "20 - PAGE_SHIFT" is negative. The
      previous calculating here will generate an unexpected result. In
      addition, if PAGE_SIZE >= 1MB, The memory size of "numentries" was
      already integral multiple of 1MB.
      
      Signed-off-by: default avatarJerry Zhou <uulinux@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a7e83318
    • Jingoo Han's avatar
      mm: replace strict_strtoul() with kstrtoul() · 3dbb95f7
      Jingoo Han authored
      
      
      The use of strict_strtoul() is not preferred, because strict_strtoul() is
      obsolete.  Thus, kstrtoul() should be used.
      
      Signed-off-by: default avatarJingoo Han <jg1.han@samsung.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3dbb95f7
    • Dave Hansen's avatar
      mm: vmstats: track TLB flush stats on UP too · 6df46865
      Dave Hansen authored
      
      
      The previous patch doing vmstats for TLB flushes ("mm: vmstats: tlb flush
      counters") effectively missed UP since arch/x86/mm/tlb.c is only compiled
      for SMP.
      
      UP systems do not do remote TLB flushes, so compile those counters out on
      UP.
      
      arch/x86/kernel/cpu/mtrr/generic.c calls __flush_tlb() directly.  This is
      probably an optimization since both the mtrr code and __flush_tlb() write
      cr4.  It would probably be safe to make that a flush_tlb_all() (and then
      get these statistics), but the mtrr code is ancient and I'm hesitant to
      touch it other than to just stick in the counters.
      
      [akpm@linux-foundation.org: tweak comments]
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6df46865
    • Dave Hansen's avatar
      mm: vmstats: tlb flush counters · 9824cf97
      Dave Hansen authored
      
      
      I was investigating some TLB flush scaling issues and realized that we do
      not have any good methods for figuring out how many TLB flushes we are
      doing.
      
      It would be nice to be able to do these in generic code, but the
      arch-independent calls don't explicitly specify whether we actually need
      to do remote flushes or not.  In the end, we really need to know if we
      actually _did_ global vs.  local invalidations, so that leaves us with few
      options other than to muck with the counters from arch-specific code.
      
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9824cf97
    • Sunghan Suh's avatar
      mm/zswap.c: get swapper address_space by using macro · 822518dc
      Sunghan Suh authored
      
      
      There is a proper macro to get the corresponding swapper address space
      from a swap entry.  Instead of directly accessing "swapper_spaces" array,
      use the "swap_address_space" macro.
      
      Signed-off-by: default avatarSunghan Suh <sunghan.suh@samsung.com>
      Reviewed-by: default avatarBob Liu <bob.liu@oracle.com>
      Reviewed-by: default avatarWanpeng Li <liwanp@linux.vnet.ibm.com>
      Acked-by: default avatarSeth Jennings <sjenning@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      822518dc
    • Oleg Nesterov's avatar
      mm: mmap_region: kill correct_wcount/inode, use allow_write_access() · e8686772
      Oleg Nesterov authored
      
      
      correct_wcount and inode in mmap_region() just complicate the code.  This
      boolean was needed previously, when deny_write_access() was called before
      vma_merge(), now we can simply check VM_DENYWRITE and do
      allow_write_access() if it is set.
      
      allow_write_access() checks file != NULL, so this is safe even if it was
      possible to use VM_DENYWRITE && !file.  Just we need to ensure we use the
      same file which was deny_write_access()'ed, so the patch also moves "file
      = vma->vm_file" down after allow_write_access().
      
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Colin Cross <ccross@android.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e8686772
    • Oleg Nesterov's avatar
      mm: do_mmap_pgoff: cleanup the usage of file_inode() · 077bf22b
      Oleg Nesterov authored
      
      
      Simple cleanup.  Move "struct inode *inode" variable into "if (file)"
      block to simplify the code and avoid the unnecessary check.
      
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Colin Cross <ccross@android.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      077bf22b
    • Oleg Nesterov's avatar
      mm: shift VM_GROWS* check from mmap_region() to do_mmap_pgoff() · b2c56e4f
      Oleg Nesterov authored
      
      
      mmap() doesn't allow the non-anonymous mappings with VM_GROWS* bit set.
      In particular this means that mmap_region()->vma_merge(file, vm_flags)
      must always fail if "vm_flags & VM_GROWS" is set incorrectly.
      
      So it does not make sense to check VM_GROWS* after we already allocated
      the new vma, the only caller, do_mmap_pgoff(), which can pass this flag
      can do the check itself.
      
      And this looks a bit more correct, mmap_region() already unmapped the
      old mapping at this stage. But if mmap() is going to fail, it should
      avoid do_munmap() if possible.
      
      Note: we check VM_GROWS at the end to ensure that do_mmap_pgoff() won't
      return EINVAL in the case when it currently returns another error code.
      
      Many thanks to Hugh who nacked the buggy v1.
      
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b2c56e4f
    • Andrew Morton's avatar
      mm/swapfile.c: convert to pr_foo() · 465c47fd
      Andrew Morton authored
      
      
      A few 80-col gymnastics were cleaned up as a result.
      
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      465c47fd
    • Raymond Jennings's avatar
      swap: warn when a swap area overflows the maximum size · d6bbbd29
      Raymond Jennings authored
      
      
      It is possible to swapon a swap area that is too big for the pte width
      to handle.
      
      Presently this failure happens silently.
      
      Instead, emit a diagnostic to warn the user.
      
      Testing results, root prompt commands and kernel log messages:
      
      # lvresize /dev/system/swap --size 16G
      # mkswap /dev/system/swap
      # swapon /dev/system/swap
      
      Jul  7 04:27:22 warfang kernel: Adding 16777212k swap
      on /dev/mapper/system-swap.  Priority:-1 extents:1 across:16777212k
      
      # lvresize /dev/system/swap --size 64G
      # mkswap /dev/system/swap
      # swapon /dev/system/swap
      
      Jul  7 04:27:22 warfang kernel: Truncating oversized swap area, only
      using 33554432k out of 67108860k
      Jul  7 04:27:22 warfang kernel: Adding 33554428k swap
      on /dev/mapper/system-swap.  Priority:-1 extents:1 across:33554428k
      
      [akpm@linux-foundation.org: fix warning]
      Signed-off-by: default avatarRaymond Jennings <shentino@gmail.com>
      Acked-by: default avatarValdis Kletnieks <valdis.kletnieks@vt.edu>
      Reviewed-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d6bbbd29
    • Vladimir Cernov's avatar
      mm/madvise.c: fix coding-style errors · ec9bed9d
      Vladimir Cernov authored
      
      
      This fixes following errors:
      	- ERROR: "(foo*)" should be "(foo *)"
      	- ERROR: "foo ** bar" should be "foo **bar"
      
      Signed-off-by: default avatarVladimir Cernov <gg.kaspersky@gmail.com>
      Reviewed-by: default avatarPekka Enberg <penberg@kernel.org>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ec9bed9d
    • Oleg Nesterov's avatar
      mm: mempolicy: turn vma_set_policy() into vma_dup_policy() · ef0855d3
      Oleg Nesterov authored
      
      
      Simple cleanup.  Every user of vma_set_policy() does the same work, this
      looks a bit annoying imho.  And the new trivial helper which does
      mpol_dup() + vma_set_policy() to simplify the callers.
      
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ef0855d3
    • Jingoo Han's avatar
      drivers/block/swim.c: remove unnecessary platform_set_drvdata() · c07303c0
      Jingoo Han authored
      
      
      The driver core clears the driver data to NULL after device_release or
      on probe failure.  Thus, it is not needed to manually clear the device
      driver data to NULL.
      
      Signed-off-by: default avatarJingoo Han <jg1.han@samsung.com>
      Cc: Jean Delvare <khali@linux-fr.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c07303c0
    • Mike Miller's avatar
      cciss: set max scatter gather entries to 32 on P600 · e7b18ede
      Mike Miller authored
      
      
      At one time we used to set the maximum number of scatter gather elements
      on all Smart Array controllers to 32.  At some point in time the
      firmware began to write the "appropriate" value for each controller into
      the config table.  The cciss driver would then read that and set
      h->maxsgentries.
      
              h->maxsgentries = readl(&(h->cfgtable->MaxSGElements);
      
      On the P600 that value is 544.  Under some workloads a significant
      performance reduction may result.  This patch forces the P600 to use
      only 32 scatter gather elements.  Other controllers are not affected.
      
      Signed-off-by: default avatarMike Miller <mike.miller@hp.com>
      Signed-off-by: default avatarDwight (Bud) Brown <bubrown@redhat.com>
      Signed-off-by: default avatarTomas Henzl <thenzl@redhat.com>
      Acked-by: default avatarStephen M. Cameron <steve.cameron@hp.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e7b18ede
    • Jingoo Han's avatar
      drivers/block/mg_disk.c: make mg_times_out() static · c86db975
      Jingoo Han authored
      
      
      mg_times_out() is used only in this file.  Fix the following sparse
      warning:
      
        drivers/block/mg_disk.c:639:6: warning: symbol 'mg_times_out' was not declared. Should it be static?
      
      Signed-off-by: default avatarJingoo Han <jg1.han@samsung.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c86db975
    • Cai Zhiyong's avatar
      block: support embedded device command line partition · bab55417
      Cai Zhiyong authored
      
      
      Read block device partition table from command line.  The partition used
      for fixed block device (eMMC) embedded device.  It is no MBR, save
      storage space.  Bootloader can be easily accessed by absolute address of
      data on the block device.  Users can easily change the partition.
      
      This code reference MTD partition, source "drivers/mtd/cmdlinepart.c"
      About the partition verbose reference
      "Documentation/block/cmdline-partition.txt"
      
      [akpm@linux-foundation.org: fix printk text]
      [yongjun_wei@trendmicro.com.cn: fix error return code in parse_parts()]
      Signed-off-by: default avatarCai Zhiyong <caizhiyong@huawei.com>
      Cc: Karel Zak <kzak@redhat.com>
      Cc: "Wanglin (Albert)" <albert.wanglin@huawei.com>
      Cc: Marius Groeger <mag@sysgo.de>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Brian Norris <computersforpeace@gmail.com>
      Cc: Artem Bityutskiy <dedekind@infradead.org>
      Signed-off-by: default avatarWei Yongjun <yongjun_wei@trendmicro.com.cn>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bab55417
    • Jingoo Han's avatar
      block/blk-sysfs.c: replace strict_strtoul() with kstrtoul() · ed751e68
      Jingoo Han authored
      
      
      The usage of strict_strtoul() is not preferred, because strict_strtoul()
      is obsolete.  Thus, kstrtoul() should be used.
      
      Signed-off-by: default avatarJingoo Han <jg1.han@samsung.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ed751e68
    • Jingoo Han's avatar
      block: replace strict_strtoul() with kstrtoul() · bb8e0e84
      Jingoo Han authored
      
      
      The use of strict_strtoul() is not preferred, because strict_strtoul() is
      obsolete.  Thus, kstrtoul() should be used.
      
      Signed-off-by: default avatarJingoo Han <jg1.han@samsung.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      bb8e0e84
    • Oleg Nesterov's avatar
      include/linux/sched.h: don't use task->pid/tgid in same_thread_group/has_group_leader_pid · e1403b8e
      Oleg Nesterov authored
      
      
      task_struct->pid/tgid should go away.
      
      1. Change same_thread_group() to use task->signal for comparison.
      
      2. Change has_group_leader_pid(task) to compare task_pid(task) with
         signal->leader_pid.
      
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Sergey Dyasly <dserrg@gmail.com>
      Reviewed-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e1403b8e
    • Jie Liu's avatar
      ocfs2: fix the end cluster offset of FIEMAP · 28e8be31
      Jie Liu authored
      Call fiemap ioctl(2) with given start offset as well as an desired mapping
      range should show extents if possible.  However, we somehow figure out the
      end offset of mapping via 'mapping_end -= cpos' before iterating the
      extent records which would cause problems if the given fiemap length is
      too small to a cluster size, e.g,
      
      Cluster size 4096:
      debugfs.ocfs2 1.6.3
              Block Size Bits: 12   Cluster Size Bits: 12
      
      The extended fiemap test utility From David:
      https://gist.github.com/anonymous/6172331
      
      
      
      # dd if=/dev/urandom of=/ocfs2/test_file bs=1M count=1000
      # ./fiemap /ocfs2/test_file 4096 10
      start: 4096, length: 10
      File /ocfs2/test_file has 0 extents:
      #	Logical          Physical         Length           Flags
      	^^^^^ <-- No extent is shown
      
      In this case, at ocfs2_fiemap(): cpos == mapping_end == 1. Hence the
      loop of searching extent records was not executed at all.
      
      This patch remove the in question 'mapping_end -= cpos', and loops
      until the cpos is larger than the mapping_end as usual.
      
      # ./fiemap /ocfs2/test_file 4096 10
      start: 4096, length: 10
      File /ocfs2/test_file has 1 extents:
      #	Logical          Physical         Length           Flags
      0:	0000000000000000 0000000056a01000 0000000006a00000 0000
      
      Signed-off-by: default avatarJie Liu <jeff.liu@oracle.com>
      Reported-by: default avatarDavid Weber <wb@munzinger.de>
      Tested-by: default avatarDavid Weber <wb@munzinger.de>
      Cc: Sunil Mushran <sunil.mushran@gmail.com>
      Cc: Mark Fashen <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      28e8be31
    • Joseph Qi's avatar
      ocfs2: remove unused variable ip in dlmfs_get_root_inode() · a72e27d3
      Joseph Qi authored
      
      
      Variable ip in dlmfs_get_root_inode() is defined but not used.  So clean
      it up.
      
      Signed-off-by: default avatarJoseph Qi <joseph.qi@huawei.com>
      Reviewed-by: default avatarJie Liu <jeff.liu@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a72e27d3
    • Joyce's avatar
      ocfs2: fix a tiny race case when firing callbacks · 6f8648e8
      Joyce authored
      
      
      In o2hb_shutdown_slot() and o2hb_check_slot(), since event is defined as
      local, it is only valid during the call stack.  So the following tiny race
      case may happen in a multi-volumes mounted environment:
      
      o2hb-vol1                         o2hb-vol2
      1) o2hb_shutdown_slot
      allocate local event1
      2) queue_node_event
      add event1 to global o2hb_node_events
                                        3) o2hb_shutdown_slot
                                        allocate local event2
                                        4) queue_node_event
                                        add event2 to global o2hb_node_events
                                        5) o2hb_run_event_list
                                        delete event1 from o2hb_node_events
      6) o2hb_run_event_list
      event1 empty, return
      7) o2hb_shutdown_slot
      event1 lifecycle ends
                                        8) o2hb_fire_callbacks
                                        event1 is already *invalid*
      
      This patch lets it wait on o2hb_callback_sem when another thread is firing
      callbacks.  And for performance consideration, we only call
      o2hb_run_event_list when there is an event queued.
      
      Signed-off-by: default avatarJoyce <xuejiufei@huawei.com>
      Signed-off-by: default avatarJoseph Qi <joseph.qi@huawei.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6f8648e8
    • Joseph Qi's avatar
      ocfs2: avoid possible NULL pointer dereference in o2net_accept_one() · 03dbe88a
      Joseph Qi authored
      
      
      Since o2nm_get_node_by_num() may return NULL, we add this check in
      o2net_accept_one() to avoid possible NULL pointer dereference.
      
      Signed-off-by: default avatarJoseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      03dbe88a
    • Joseph Qi's avatar
      ocfs2: adjust code style for o2net_handler_tree_lookup() · 9a239e4c
      Joseph Qi authored
      
      
      Code in o2net_handler_tree_lookup() may be corrupted by mistake.  So
      adjust it to promote readability.
      
      Signed-off-by: default avatarJoseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9a239e4c
    • Younger Liu's avatar
      ocfs2: free path in ocfs2_remove_inode_range() · 7aebff18
      Younger Liu authored
      
      
      In ocfs2_remove_inode_range(), there is a memory leak.  The variable path
      has allocated memory with ocfs2_new_path_from_et(), but it is not free.
      
      Signed-off-by: default avatarYounger Liu <younger.liu@huawei.com>
      Reviewed-by: default avatarJie Liu <jeff.liu@oracle.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7aebff18
    • Joseph Qi's avatar
      ocfs2: fix possible double free in ocfs2_reflink_xattr_rec · 6cae6d31
      Joseph Qi authored
      
      
      In ocfs2_reflink_xattr_rec(), meta_ac and data_ac are allocated by calling
      ocfs2_lock_reflink_xattr_rec_allocators().
      
      Once an error occurs when allocating *data_ac, it frees *meta_ac which is
      allocated before.  Here it mistakenly sets meta_ac to NULL but *meta_ac.
      Then ocfs2_reflink_xattr_rec() will try to free meta_ac again which is
      already invalid.
      
      Signed-off-by: default avatarJoseph Qi <joseph.qi@huawei.com>
      Reviewed-by: default avatarJie Liu <jeff.liu@oracle.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6cae6d31
    • Xue jiufei's avatar
      ocfs2/dlm: force clean refmap when doing local cleanup · 69b2bd16
      Xue jiufei authored
      
      
      dlm_do_local_recovery_cleanup() should force clean refmap if the owner of
      lockres is UNKNOWN.  Otherwise node may hang when umounting filesystems.
      Here's the situation:
      
      	Node1                                    Node2
      dlmlock()
        -> dlm_get_lock_resource()
      send DLM_MASTER_REQUEST_MSG to
      other nodes.
      
                                             trying to master this lockres,
                                             return MAYBE.
      
      selected as the master of lockresA,
      set mle->master to Node1,
      and do assert_master,
      send DLM_ASSERT_MASTER_MSG to Node2.
                                             Node 2 has interest on lockresA
                                             and return
                                             DLM_ASSERT_RESPONSE_MASTERY_REF
                                             then something happened and
                                             Node2 crashed.
      
      Receiving DLM_ASSERT_RESPONSE_MASTERY_REF, set Node2 into refmap, and keep
      sending DLM_ASSERT_MASTER_MSG to other nodes
      
      o2hb found node2 down, calling dlm_hb_node_down() -->
      dlm_do_local_recovery_cleanup() the master of lockresA is still UNKNOWN,
      no need to call dlm_free_dead_locks().
      
      Set the master of lockresA to Node1, but Node2 stills remains in refmap.
      
      When Node1 umount, it found that the refmap of lockresA is not empty and
      attempted to migrate it to Node2, But Node2 is already down, so umount
      hang, trying to migrate lockresA again and again.
      
      Signed-off-by: default avatarjoyce <xuejiufei@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Jie Liu <jeff.liu@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      69b2bd16