Skip to content
  1. Feb 07, 2014
    • Tang Chen's avatar
      arch/x86/mm/numa.c: initialize numa_kernel_nodes in numa_clear_kernel_node_hotplug() · 017c217a
      Tang Chen authored
      
      
      On-stack variable numa_kernel_nodes in numa_clear_kernel_node_hotplug()
      was not initialized.  So we need to initialize it.
      
      [akpm@linux-foundation.org: use NODE_MASK_NONE, per David]
      Signed-off-by: default avatarTang Chen <tangchen@cn.fujitsu.com>
      Tested-by: default avatarGu Zheng <guz.fnst@cn.fujitsu.com>
      Reported-by: default avatarDave Jones <davej@redhat.com>
      Reported-by: default avatarDavid Rientjes <rientjes@google.com>
      Tested-by: default avatarDave Jones <davej@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      017c217a
    • KOSAKI Motohiro's avatar
      mm: __set_page_dirty_nobuffers() uses spin_lock_irqsave() instead of spin_lock_irq() · a85d9df1
      KOSAKI Motohiro authored
      
      
      During aio stress test, we observed the following lockdep warning.  This
      mean AIO+numa_balancing is currently deadlockable.
      
      The problem is, aio_migratepage disable interrupt, but
      __set_page_dirty_nobuffers unintentionally enable it again.
      
      Generally, all helper function should use spin_lock_irqsave() instead of
      spin_lock_irq() because they don't know caller at all.
      
         other info that might help us debug this:
          Possible unsafe locking scenario:
      
                CPU0
                ----
           lock(&(&ctx->completion_lock)->rlock);
           <Interrupt>
             lock(&(&ctx->completion_lock)->rlock);
      
          *** DEADLOCK ***
      
            dump_stack+0x19/0x1b
            print_usage_bug+0x1f7/0x208
            mark_lock+0x21d/0x2a0
            mark_held_locks+0xb9/0x140
            trace_hardirqs_on_caller+0x105/0x1d0
            trace_hardirqs_on+0xd/0x10
            _raw_spin_unlock_irq+0x2c/0x50
            __set_page_dirty_nobuffers+0x8c/0xf0
            migrate_page_copy+0x434/0x540
            aio_migratepage+0xb1/0x140
            move_to_new_page+0x7d/0x230
            migrate_pages+0x5e5/0x700
            migrate_misplaced_page+0xbc/0xf0
            do_numa_page+0x102/0x190
            handle_pte_fault+0x241/0x970
            handle_mm_fault+0x265/0x370
            __do_page_fault+0x172/0x5a0
            do_page_fault+0x1a/0x70
            page_fault+0x28/0x30
      
      Signed-off-by: default avatarKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a85d9df1
    • Weijie Yang's avatar
      mm/swap: fix race on swap_info reuse between swapoff and swapon · f893ab41
      Weijie Yang authored
      
      
      swapoff clear swap_info's SWP_USED flag prematurely and free its
      resources after that.  A concurrent swapon will reuse this swap_info
      while its previous resources are not cleared completely.
      
      These late freed resources are:
       - p->percpu_cluster
       - swap_cgroup_ctrl[type]
       - block_device setting
       - inode->i_flags &= ~S_SWAPFILE
      
      This patch clears the SWP_USED flag after all its resources are freed,
      so that swapon can reuse this swap_info by alloc_swap_info() safely.
      
      [akpm@linux-foundation.org: tidy up code comment]
      Signed-off-by: default avatarWeijie Yang <weijie.yang@samsung.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Cc: Krzysztof Kozlowski <k.kozlowski@samsung.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f893ab41
    • Shaohua Li's avatar
      swap: add a simple detector for inappropriate swapin readahead · 579f8290
      Shaohua Li authored
      
      
      This is a patch to improve swap readahead algorithm.  It's from Hugh and
      I slightly changed it.
      
      Hugh's original changelog:
      
      swapin readahead does a blind readahead, whether or not the swapin is
      sequential.  This may be ok on harddisk, because large reads have
      relatively small costs, and if the readahead pages are unneeded they can
      be reclaimed easily - though, what if their allocation forced reclaim of
      useful pages? But on SSD devices large reads are more expensive than
      small ones: if the readahead pages are unneeded, reading them in caused
      significant overhead.
      
      This patch adds very simplistic random read detection.  Stealing the
      PageReadahead technique from Konstantin Khlebnikov's patch, avoiding the
      vma/anon_vma sophistications of Shaohua Li's patch, swapin_nr_pages()
      simply looks at readahead's current success rate, and narrows or widens
      its readahead window accordingly.  There is little science to its
      heuristic: it's about as stupid as can be whilst remaining effective.
      
      The table below shows elapsed times (in centiseconds) when running a
      single repetitive swapping load across a 1000MB mapping in 900MB ram
      with 1GB swap (the harddisk tests had taken painfully too long when I
      used mem=500M, but SSD shows similar results for that).
      
      Vanilla is the 3.6-rc7 kernel on which I started; Shaohua denotes his
      Sep 3 patch in mmotm and linux-next; HughOld denotes my Oct 1 patch
      which Shaohua showed to be defective; HughNew this Nov 14 patch, with
      page_cluster as usual at default of 3 (8-page reads); HughPC4 this same
      patch with page_cluster 4 (16-page reads); HughPC0 with page_cluster 0
      (1-page reads: no readahead).
      
      HDD for swapping to harddisk, SSD for swapping to VertexII SSD.  Seq for
      sequential access to the mapping, cycling five times around; Rand for
      the same number of random touches.  Anon for a MAP_PRIVATE anon mapping;
      Shmem for a MAP_SHARED anon mapping, equivalent to tmpfs.
      
      One weakness of Shaohua's vma/anon_vma approach was that it did not
      optimize Shmem: seen below.  Konstantin's approach was perhaps mistuned,
      50% slower on Seq: did not compete and is not shown below.
      
      HDD        Vanilla Shaohua HughOld HughNew HughPC4 HughPC0
      Seq Anon     73921   76210   75611   76904   78191  121542
      Seq Shmem    73601   73176   73855   72947   74543  118322
      Rand Anon   895392  831243  871569  845197  846496  841680
      Rand Shmem 1058375 1053486  827935  764955  764376  756489
      
      SSD        Vanilla Shaohua HughOld HughNew HughPC4 HughPC0
      Seq Anon     24634   24198   24673   25107   21614   70018
      Seq Shmem    24959   24932   25052   25703   22030   69678
      Rand Anon    43014   26146   28075   25989   26935   25901
      Rand Shmem   45349   45215   28249   24268   24138   24332
      
      These tests are, of course, two extremes of a very simple case: under
      heavier mixed loads I've not yet observed any consistent improvement or
      degradation, and wider testing would be welcome.
      
      Shaohua Li:
      
      Test shows Vanilla is slightly better in sequential workload than Hugh's
      patch.  I observed with Hugh's patch sometimes the readahead size is
      shrinked too fast (from 8 to 1 immediately) in sequential workload if
      there is no hit.  And in such case, continuing doing readahead is good
      actually.
      
      I don't prepare a sophisticated algorithm for the sequential workload
      because so far we can't guarantee sequential accessed pages are swap out
      sequentially.  So I slightly change Hugh's heuristic - don't shrink
      readahead size too fast.
      
      Here is my test result (unit second, 3 runs average):
      	Vanilla		Hugh		New
      Seq	356		370		360
      Random	4525		2447		2444
      
      Attached graph is the swapin/swapout throughput I collected with 'vmstat
      2'.  The first part is running a random workload (till around 1200 of
      the x-axis) and the second part is running a sequential workload.
      swapin and swapout throughput are almost identical in steady state in
      both workloads.  These are expected behavior.  while in Vanilla, swapin
      is much bigger than swapout especially in random workload (because wrong
      readahead).
      
      Original patches by: Shaohua Li and Konstantin Khlebnikov.
      
      [fengguang.wu@intel.com: swapin_nr_pages() can be static]
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarShaohua Li <shli@fusionio.com>
      Signed-off-by: default avatarFengguang Wu <fengguang.wu@intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      579f8290
    • Zongxun Wang's avatar
      ocfs2: free allocated clusters if error occurs after ocfs2_claim_clusters · fb951eb5
      Zongxun Wang authored
      
      
      Even if using the same jbd2 handle, we cannot rollback a transaction.
      So once some error occurs after successfully allocating clusters, the
      allocated clusters will never be used and it means they are lost.  For
      example, call ocfs2_claim_clusters successfully when expanding a file,
      but failed in ocfs2_insert_extent.  So we need free the allocated
      clusters if they are not used indeed.
      
      Signed-off-by: default avatarZongxun Wang <wangzongxun@huawei.com>
      Signed-off-by: default avatarJoseph Qi <joseph.qi@huawei.com>
      Acked-by: default avatarJoel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fb951eb5
    • Randy Dunlap's avatar
      Documentation/kernel-parameters.txt: fix memmap= language · 277cba1d
      Randy Dunlap authored
      
      
      Clean up descriptions of memmap= boot options.
      
      Add periods (full stops), drop commas, change "used" to "reserved" or
      "marked".
      
      Signed-off-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Cc: Andiry Xu <andiry.xu@gmail.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      277cba1d
  2. Feb 06, 2014
  3. Feb 05, 2014
    • Linus Torvalds's avatar
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs · 878a876b
      Linus Torvalds authored
      Pull btrfs fixes from Chris Mason:
       "Filipe is fixing compile and boot problems with our crc32c rework, and
        Josef has disabled snapshot aware defrag for now.
      
        As the number of snapshots increases, we're hitting OOM.  For the
        short term we're disabling things until a bigger fix is ready"
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
        Btrfs: use late_initcall instead of module_init
        Btrfs: use btrfs_crc32c everywhere instead of libcrc32c
        Btrfs: disable snapshot aware defrag for now
      878a876b
    • Linus Torvalds's avatar
      Merge tag 'nfs-for-3.14-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs · d7512f79
      Linus Torvalds authored
      Pull NFS client bugfixes from Trond Myklebust:
       "Highlights:
      
         - Fix NFSv3 acl regressions
         - Fix NFSv4 memory corruption due to slot table abuse in
           nfs4_proc_open_confirm
         - nfs4_destroy_session must call rpc_destroy_waitqueue"
      
      * tag 'nfs-for-3.14-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
        fs: get_acl() must be allowed to return EOPNOTSUPP
        NFSv3: Fix return value of nfs3_proc_setacls
        NFSv3: Remove unused function nfs3_proc_set_default_acl
        NFSv4.1: nfs4_destroy_session must call rpc_destroy_waitqueue
        NFSv4: Fix memory corruption in nfs4_proc_open_confirm
        nfs: fix setting of ACLs on file creation.
      d7512f79
    • Linus Torvalds's avatar
      kbuild: don't enable DEBUG_INFO when building for COMPILE_TEST · 12b13835
      Linus Torvalds authored
      
      
      It really isn't very interesting to have DEBUG_INFO when doing compile
      coverage stuff (you wouldn't want to run the result anyway, that's kind
      of the whole point of COMPILE_TEST), and it currently makes the build
      take longer and use much more disk space for "all{yes,mod}config".
      
      There's somewhat active discussion about this still, and we might end up
      with some new config option for things like this (Andi points out that
      the silly X86_DECODER_SELFTEST option also slows down the normal
      coverage tests hugely), but I'm starting the ball rolling with this
      simple one-liner.
      
      DEBUG_INFO isn't that noticeable if you have tons of memory and a good
      IO subsystem, but it hurts you a lot if you don't - for very little
      upside for the common use.
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      12b13835
  4. Feb 04, 2014
  5. Feb 03, 2014