Skip to content
  1. Aug 21, 2015
    • Eric W. Biederman's avatar
      vfs: Test for and handle paths that are unreachable from their mnt_root · 397d425d
      Eric W. Biederman authored
      
      
      In rare cases a directory can be renamed out from under a bind mount.
      In those cases without special handling it becomes possible to walk up
      the directory tree to the root dentry of the filesystem and down
      from the root dentry to every other file or directory on the filesystem.
      
      Like division by zero .. from an unconnected path can not be given
      a useful semantic as there is no predicting at which path component
      the code will realize it is unconnected.  We certainly can not match
      the current behavior as the current behavior is a security hole.
      
      Therefore when encounting .. when following an unconnected path
      return -ENOENT.
      
      - Add a function path_connected to verify path->dentry is reachable
        from path->mnt.mnt_root.  AKA to validate that rename did not do
        something nasty to the bind mount.
      
        To avoid races path_connected must be called after following a path
        component to it's next path component.
      
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      397d425d
    • Eric W. Biederman's avatar
      dcache: Reduce the scope of i_lock in d_splice_alias · a03e283b
      Eric W. Biederman authored
      
      
      i_lock is only needed until __d_find_any_alias calls dget on the alias
      dentry.  After that the reference to new ensures that dentry_kill and
      d_delete will not remove the inode from the dentry, and remove the
      dentry from the inode->d_entry list.
      
      The inode i_lock came to be held over the the __d_move calls in
      d_splice_alias through a series of introduction of locks with
      increasing smaller scope.  First it was the dcache_lock, then
      it was the dcache_inode_lock, and finally inode->i_lock.
      
      Furthermore inode->i_lock is not held over any other calls
      to d_move or __d_move so it can not provide any meaningful
      rename protection.
      
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      a03e283b
    • Eric W. Biederman's avatar
      dcache: Handle escaped paths in prepend_path · cde93be4
      Eric W. Biederman authored
      
      
      A rename can result in a dentry that by walking up d_parent
      will never reach it's mnt_root.  For lack of a better term
      I call this an escaped path.
      
      prepend_path is called by four different functions __d_path,
      d_absolute_path, d_path, and getcwd.
      
      __d_path only wants to see paths are connected to the root it passes
      in.  So __d_path needs prepend_path to return an error.
      
      d_absolute_path similarly wants to see paths that are connected to
      some root.  Escaped paths are not connected to any mnt_root so
      d_absolute_path needs prepend_path to return an error greater
      than 1.  So escaped paths will be treated like paths on lazily
      unmounted mounts.
      
      getcwd needs to prepend "(unreachable)" so getcwd also needs
      prepend_path to return an error.
      
      d_path is the interesting hold out.  d_path just wants to print
      something, and does not care about the weird cases.  Which raises
      the question what should be printed?
      
      Given that <escaped_path>/<anything> should result in -ENOENT I
      believe it is desirable for escaped paths to be printed as empty
      paths.  As there are not really any meaninful path components when
      considered from the perspective of a mount tree.
      
      So tweak prepend_path to return an empty path with an new error
      code of 3 when it encounters an escaped path.
      
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      cde93be4
    • Hugh Dickins's avatar
      mm: fix potential data race in SyS_swapon · 6f179af8
      Hugh Dickins authored
      
      
      While running KernelThreadSanitizer (ktsan) on upstream kernel with
      trinity, we got a few reports from SyS_swapon, here is one of them:
      
      Read of size 8 by thread T307 (K7621):
       [<     inlined    >] SyS_swapon+0x3c0/0x1850 SYSC_swapon mm/swapfile.c:2395
       [<ffffffff812242c0>] SyS_swapon+0x3c0/0x1850 mm/swapfile.c:2345
       [<ffffffff81e97c8a>] ia32_do_call+0x1b/0x25
      
      Looks like the swap_lock should be taken when iterating through the
      swap_info array on lines 2392 - 2401: q->swap_file may be reset to
      NULL by another thread before it is dereferenced for f_mapping.
      
      But why is that iteration needed at all?  Doesn't the claim_swapfile()
      which follows do all that is needed to check for a duplicate entry -
      FMODE_EXCL on a bdev, testing IS_SWAPFILE under i_mutex on a regfile?
      
      Well, not quite: bd_may_claim() allows the same "holder" to claim the
      bdev again, so we do need to use a different holder than "sys_swapon";
      and we should not replace appropriate -EBUSY by inappropriate -EINVAL.
      
      Index i was reused in a cpu loop further down: renamed cpu there.
      
      Reported-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      6f179af8
    • Al Viro's avatar
      Merge branch 'superblock-scaling' of... · 061f98e9
      Al Viro authored
      Merge branch 'superblock-scaling' of git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next into for-next
      
      Conflicts:
      	include/linux/fs.h
      061f98e9
  2. Aug 19, 2015
    • Al Viro's avatar
      Merge branch 'ufs' into for-next · b5f5914c
      Al Viro authored
      b5f5914c
    • Al Viro's avatar
      Merge branch 'sb_writers_pcpu_rwsem' of... · 15cf3b7a
      Al Viro authored
      Merge branch 'sb_writers_pcpu_rwsem' of git://git.kernel.org/pub/scm/linux/kernel/git/oleg/misc into for-next
      15cf3b7a
    • Josef Bacik's avatar
      inode: don't softlockup when evicting inodes · ac05fbb4
      Josef Bacik authored
      
      
      On a box with a lot of ram (148gb) I can make the box softlockup after running
      an fs_mark job that creates hundreds of millions of empty files.  This is
      because we never generate enough memory pressure to keep the number of inodes on
      our unused list low, so when we go to unmount we have to evict ~100 million
      inodes.  This makes one processor a very unhappy person, so add a cond_resched()
      in dispose_list() and if we need a resched when processing the s_inodes list do
      that and run dispose_list() on what we've currently culled.  Thanks,
      
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      ac05fbb4
  3. Aug 18, 2015
    • Dave Chinner's avatar
      inode: rename i_wb_list to i_io_list · c7f54084
      Dave Chinner authored
      
      
      There's a small consistency problem between the inode and writeback
      naming. Writeback calls the "for IO" inode queues b_io and
      b_more_io, but the inode calls these the "writeback list" or
      i_wb_list. This makes it hard to an new "under writeback" list to
      the inode, or call it an "under IO" list on the bdi because either
      way we'll have writeback on IO and IO on writeback and it'll just be
      confusing. I'm getting confused just writing this!
      
      So, rename the inode "for IO" list variable to i_io_list so we can
      add a new "writeback list" in a subsequent patch.
      
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Tested-by: default avatarDave Chinner <dchinner@redhat.com>
      c7f54084
    • Dave Chinner's avatar
      sync: serialise per-superblock sync operations · e97fedb9
      Dave Chinner authored
      
      
      When competing sync(2) calls walk the same filesystem, they need to
      walk the list of inodes on the superblock to find all the inodes
      that we need to wait for IO completion on. However, when multiple
      wait_sb_inodes() calls do this at the same time, they contend on the
      the inode_sb_list_lock and the contention causes system wide
      slowdowns. In effect, concurrent sync(2) calls can take longer and
      burn more CPU than if they were serialised.
      
      Stop the worst of the contention by adding a per-sb mutex to wrap
      around wait_sb_inodes() so that we only execute one sync(2) IO
      completion walk per superblock superblock at a time and hence avoid
      contention being triggered by concurrent sync(2) calls.
      
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Tested-by: default avatarDave Chinner <dchinner@redhat.com>
      e97fedb9
    • Dave Chinner's avatar
      inode: convert inode_sb_list_lock to per-sb · 74278da9
      Dave Chinner authored
      
      
      The process of reducing contention on per-superblock inode lists
      starts with moving the locking to match the per-superblock inode
      list. This takes the global lock out of the picture and reduces the
      contention problems to within a single filesystem. This doesn't get
      rid of contention as the locks still have global CPU scope, but it
      does isolate operations on different superblocks form each other.
      
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Tested-by: default avatarDave Chinner <dchinner@redhat.com>
      74278da9
    • Josef Bacik's avatar
      inode: add hlist_fake to avoid the inode hash lock in evict · cbedaac6
      Josef Bacik authored
      
      
      Some filesystems don't use the VFS inode hash and fake the fact they
      are hashed so that all the writeback code works correctly. However,
      this means the evict() path still tries to remove the inode from the
      hash, meaning that the inode_hash_lock() needs to be taken
      unnecessarily. Hence under certain workloads the inode_hash_lock can
      be contended even if the inode is never actually hashed.
      
      To avoid this add hlist_fake to test if the inode isn't actually
      hashed to avoid taking the hash lock on inodes that have never been
      hashed.  Based on Dave Chinner's
      
      inode: add IOP_NOTHASHED to avoid inode hash lock in evict
      
      basd on Al's suggestions.  Thanks,
      
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Tested-by: default avatarDave Chinner <dchinner@redhat.com>
      cbedaac6
    • Dave Chinner's avatar
      writeback: plug writeback at a high level · d353d758
      Dave Chinner authored
      
      
      Doing writeback on lots of little files causes terrible IOPS storms
      because of the per-mapping writeback plugging we do. This
      essentially causes imeediate dispatch of IO for each mapping,
      regardless of the context in which writeback is occurring.
      
      IOWs, running a concurrent write-lots-of-small 4k files using fsmark
      on XFS results in a huge number of IOPS being issued for data
      writes.  Metadata writes are sorted and plugged at a high level by
      XFS, so aggregate nicely into large IOs. However, data writeback IOs
      are dispatched in individual 4k IOs, even when the blocks of two
      consecutively written files are adjacent.
      
      Test VM: 8p, 8GB RAM, 4xSSD in RAID0, 100TB sparse XFS filesystem,
      metadata CRCs enabled.
      
      Kernel: 3.10-rc5 + xfsdev + my 3.11 xfs queue (~70 patches)
      
      Test:
      
      $ ./fs_mark  -D  10000  -S0  -n  10000  -s  4096  -L  120  -d
      /mnt/scratch/0  -d  /mnt/scratch/1  -d  /mnt/scratch/2  -d
      /mnt/scratch/3  -d  /mnt/scratch/4  -d  /mnt/scratch/5  -d
      /mnt/scratch/6  -d  /mnt/scratch/7
      
      Result:
      
      		wall	sys	create rate	Physical write IO
      		time	CPU	(avg files/s)	 IOPS	Bandwidth
      		-----	-----	------------	------	---------
      unpatched	6m56s	15m47s	24,000+/-500	26,000	130MB/s
      patched		5m06s	13m28s	32,800+/-600	 1,500	180MB/s
      improvement	-26.44%	-14.68%	  +36.67%	-94.23%	+38.46%
      
      If I use zero length files, this workload at about 500 IOPS, so
      plugging drops the data IOs from roughly 25,500/s to 1000/s.
      3 lines of code, 35% better throughput for 15% less CPU.
      
      The benefits of plugging at this layer are likely to be higher for
      spinning media as the IO patterns for this workload are going make a
      much bigger difference on high IO latency devices.....
      
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Tested-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      d353d758
  4. Aug 17, 2015
    • Linus Torvalds's avatar
      Linux 4.2-rc7 · 2c6625cd
      Linus Torvalds authored
      v4.2-rc7
      2c6625cd
    • Linus Torvalds's avatar
      Merge tag 'armsoc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc · 8916e0b0
      Linus Torvalds authored
      Pull ARM SoC fixes from Olof Johansson:
       "A smallish batch of fixes, a little more than expected this late, but
        all fixes are contained to their platforms and seem reasonably low
        risk:
      
         - a somewhat large SMP fix for ux500 that still seemed warranted to
           include here
         - OMAP DT fixes for pbias regulator specification that broke due to
           some DT reshuffling
         - PCIe IRQ routing bugfix for i.MX
         - networking fixes for keystone
         - runtime PM for OMAP GPMC
         - a couple of error path bug fixes for exynos"
      
      * tag 'armsoc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc:
        ARM: dts: keystone: Fix the mdio bindings by moving it to soc specific file
        ARM: dts: keystone: fix the clock node for mdio
        memory: omap-gpmc: Don't try to save uninitialized GPMC context
        ARM: imx6: correct i.MX6 PCIe interrupt routing
        ARM: ux500: add an SMP enablement type and move cpu nodes
        ARM: dts: dra7: Fix broken pbias device creation
        ARM: dts: OMAP5: Fix broken pbias device creation
        ARM: dts: OMAP4: Fix broken pbias device creation
        ARM: dts: omap243x: Fix broken pbias device creation
        ARM: EXYNOS: fix double of_node_put() on error path
        ARM: EXYNOS: Fix potentian kfree() of ro memory
      8916e0b0
    • Linus Torvalds's avatar
      Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus · 0f405bf7
      Linus Torvalds authored
      Pull MIPS bugfix from Ralf Baechle:
       "Only a single MIPS fix - the math when invoking syscall_trace_enter
        was wrong"
      
      * 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus:
        MIPS: Fix seccomp syscall argument for MIPS64
      0f405bf7
    • Linus Torvalds's avatar
      Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 01565479
      Linus Torvalds authored
      Merge x86 fixes from Ingo Molnar:
       "Two followup fixes related to the previous LDT fix"
      
      Also applied a further FPU emulation fix from Andy Lutomirski to the
      branch before actually merging it.
      
      * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
        x86/ldt: Further fix FPU emulation
        x86/ldt: Correct FPU emulation access to LDT
        x86/ldt: Correct LDT access in single stepping logic
      01565479
    • Andy Lutomirski's avatar
      x86/ldt: Further fix FPU emulation · 12e244f4
      Andy Lutomirski authored
      
      
      The previous fix confused a selector with a segment prefix.  Fix it.
      
      Compile-tested only.
      
      Cc: stable@vger.kernel.org
      Cc: Juergen Gross <jgross@suse.com>
      Reported-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Fixes: 4809146b
      
       ("x86/ldt: Correct FPU emulation access to LDT")
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      12e244f4
    • Jann Horn's avatar
      fs/fuse: fix ioctl type confusion · 8ed1f0e2
      Jann Horn authored
      
      
      fuse_dev_ioctl() performed fuse_get_dev() on a user-supplied fd,
      leading to a type confusion issue. Fix it by checking file->f_op.
      
      Signed-off-by: default avatarJann Horn <jann@thejh.net>
      Acked-by: default avatarMiklos Szeredi <miklos@szeredi.hu>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8ed1f0e2
    • Olof Johansson's avatar
      Merge tag 'keystone-dts-late-fixes-v2' of... · 02149517
      Olof Johansson authored
      
      Merge tag 'keystone-dts-late-fixes-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/ssantosh/linux-keystone into fixes
      
      ARM: Couple of Keysyone MDIO DTS fixes for 4.2-rc6+
      
      These are necessary to get the NIC card working on all Keystone
      EVMs. Couple of boards are broken without these two fixes.
      
      * tag 'keystone-dts-late-fixes-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/ssantosh/linux-keystone:
        ARM: dts: keystone: Fix the mdio bindings by moving it to soc specific file
        ARM: dts: keystone: fix the clock node for mdio
      
      Signed-off-by: default avatarOlof Johansson <olof@lixom.net>
      02149517
  5. Aug 16, 2015
    • Markos Chandras's avatar
      MIPS: Fix seccomp syscall argument for MIPS64 · 9f161439
      Markos Chandras authored
      Commit 4c21b8fd ("MIPS: seccomp: Handle indirect system calls (o32)")
      fixed indirect system calls on O32 but it also introduced a bug for MIPS64
      where it erroneously modified the v0 (syscall) register with the assumption
      that the sycall offset hasn't been taken into consideration. This breaks
      seccomp on MIPS64 n64 and n32 ABIs. We fix this by replacing the addition
      with a move instruction.
      
      Fixes: 4c21b8fd
      
       ("MIPS: seccomp: Handle indirect system calls (o32)")
      Cc: <stable@vger.kernel.org> # 3.15+
      Reviewed-by: default avatarJames Hogan <james.hogan@imgtec.com>
      Signed-off-by: default avatarMarkos Chandras <markos.chandras@imgtec.com>
      Cc: linux-mips@linux-mips.org
      Patchwork: https://patchwork.linux-mips.org/patch/10951/
      Signed-off-by: default avatarRalf Baechle <ralf@linux-mips.org>
      9f161439
    • Linus Torvalds's avatar
      Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · 1efdb5f0
      Linus Torvalds authored
      Pull SCSI fixes from James Bottomley:
       "This has two libfc fixes for bugs causing rare crashes, one iscsi fix
        for a potential hang on shutdown, and a fix for an I/O blocksize issue
        which caused a regression"
      
      * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
        sd: Fix maximum I/O size for BLOCK_PC requests
        libfc: Fix fc_fcp_cleanup_each_cmd()
        libfc: Fix fc_exch_recv_req() error path
        libiscsi: Fix host busy blocking during connection teardown
      1efdb5f0
  6. Aug 15, 2015
    • Oleg Nesterov's avatar
      change sb_writers to use percpu_rw_semaphore · 8129ed29
      Oleg Nesterov authored
      
      
      We can remove everything from struct sb_writers except frozen
      and add the array of percpu_rw_semaphore's instead.
      
      This patch doesn't remove sb_writers->wait_unfrozen yet, we keep
      it for get_super_thawed(). We will probably remove it later.
      
      This change tries to address the following problems:
      
      	- Firstly, __sb_start_write() looks simply buggy. It does
      	  __sb_end_write() if it sees ->frozen, but if it migrates
      	  to another CPU before percpu_counter_dec(), sb_wait_write()
      	  can wrongly succeed if there is another task which holds
      	  the same "semaphore": sb_wait_write() can miss the result
      	  of the previous percpu_counter_inc() but see the result
      	  of this percpu_counter_dec().
      
      	- As Dave Hansen reports, it is suboptimal. The trivial
      	  microbenchmark that writes to a tmpfs file in a loop runs
      	  12% faster if we change this code to rely on RCU and kill
      	  the memory barriers.
      
      	- This code doesn't look simple. It would be better to rely
      	  on the generic locking code.
      
      	  According to Dave, this change adds the same performance
      	  improvement.
      
      Note: with this change both freeze_super() and thaw_super() will do
      synchronize_sched_expedited() 3 times. This is just ugly. But:
      
      	- This will be "fixed" by the rcu_sync changes we are going
      	  to merge. After that freeze_super()->percpu_down_write()
      	  will use synchronize_sched(), and thaw_super() won't use
      	  synchronize() at all.
      
      	  This doesn't need any changes in fs/super.c.
      
      	- Once we merge rcu_sync changes, we can also change super.c
      	  so that all wb_write->rw_sem's will share the single ->rss
      	  in struct sb_writes, then freeze_super() will need only one
      	  synchronize_sched().
      
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Reviewed-by: default avatarJan Kara <jack@suse.com>
      8129ed29
    • Oleg Nesterov's avatar
      shift percpu_counter_destroy() into destroy_super_work() · 853b39a7
      Oleg Nesterov authored
      
      
      Of course, this patch is ugly as hell. It will be (partially)
      reverted later. We add it to ensure that other WIP changes in
      percpu_rw_semaphore won't break fs/super.c.
      
      We do not even need this change right now, percpu_free_rwsem()
      is fine in atomic context. But we are going to change this, it
      will be might_sleep() after we merge the rcu_sync() patches.
      
      And even after that we do not really need destroy_super_work(),
      we will kill it in any case. Instead, destroy_super_rcu() should
      just check that rss->cb_state == CB_IDLE and do call_rcu() again
      in the (very unlikely) case this is not true.
      
      So this is just the temporary kludge which helps us to avoid the
      conflicts with the changes which will be (hopefully) routed via
      rcu tree.
      
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Reviewed-by: default avatarJan Kara <jack@suse.com>
      853b39a7
    • Oleg Nesterov's avatar
      percpu-rwsem: kill CONFIG_PERCPU_RWSEM · bf3eac84
      Oleg Nesterov authored
      
      
      Remove CONFIG_PERCPU_RWSEM, the next patch adds the unconditional
      user of percpu_rw_semaphore.
      
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      bf3eac84
    • Oleg Nesterov's avatar
      percpu-rwsem: introduce percpu_rwsem_release() and percpu_rwsem_acquire() · 55cc1565
      Oleg Nesterov authored
      
      
      Add percpu_rwsem_release() and percpu_rwsem_acquire() for the users
      which need to return to userspace with percpu-rwsem lock held and/or
      pass the ownership to another thread.
      
      TODO: change percpu_rwsem_release() to use rwsem_clear_owner(). We can
      either fold kernel/locking/rwsem.h into include/linux/rwsem.h, or add
      the non-inline percpu_rwsem_clear_owner().
      
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      55cc1565
    • Oleg Nesterov's avatar
      percpu-rwsem: introduce percpu_down_read_trylock() · 9287f692
      Oleg Nesterov authored
      
      
      Add percpu_down_read_trylock(), it will have the user soon.
      
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      9287f692
    • Oleg Nesterov's avatar
      document rwsem_release() in sb_wait_write() · 0e28e01f
      Oleg Nesterov authored
      
      
      Not only we need to avoid the warning from lockdep_sys_exit(), the
      caller of freeze_super() can never release this lock. Another thread
      can do this, so there is another reason for rwsem_release().
      
      Plus the comment should explain why we have to fool lockdep.
      
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Reviewed-by: default avatarJan Kara <jack@suse.com>
      0e28e01f
    • Oleg Nesterov's avatar
      fix the broken lockdep logic in __sb_start_write() · f4b554af
      Oleg Nesterov authored
      
      
      1. wait_event(frozen < level) without rwsem_acquire_read() is just
         wrong from lockdep perspective. If we are going to deadlock
         because the caller is buggy, lockdep can't detect this problem.
      
      2. __sb_start_write() can race with thaw_super() + freeze_super(),
         and after "goto retry" the 2nd  acquire_freeze_lock() is wrong.
      
      3. The "tell lockdep we are doing trylock" hack doesn't look nice.
      
         I think this is correct, but this logic should be more explicit.
         Yes, the recursive read_lock() is fine if we hold the lock on a
         higher level. But we do not need to fool lockdep. If we can not
         deadlock in this case then try-lock must not fail and we can use
         use wait == F throughout this code.
      
      Note: as Dave Chinner explains, the "trylock" hack and the fat comment
      can be probably removed. But this needs a separate change and it will
      be trivial: just kill __sb_start_write() and rename do_sb_start_write()
      back to __sb_start_write().
      
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Reviewed-by: default avatarJan Kara <jack@suse.com>
      f4b554af
    • Oleg Nesterov's avatar
      introduce __sb_writers_{acquired,release}() helpers · bee9182d
      Oleg Nesterov authored
      
      
      Preparation to hide the sb->s_writers internals from xfs and btrfs.
      Add 2 trivial define's they can use rather than play with ->s_writers
      directly. No changes in btrfs/transaction.o and xfs/xfs_aops.o.
      
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Reviewed-by: default avatarJan Kara <jack@suse.com>
      bee9182d
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm · 45e38cff
      Linus Torvalds authored
      Pull KVM fixes from Paolo Bonzini:
       "Just two very small & simple patches"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
        KVM: x86: Use adjustment in guest cycles when handling MSR_IA32_TSC_ADJUST
        KVM: x86: zero IDT limit on entry to SMM
      45e38cff
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · 8394a1b7
      Linus Torvalds authored
      Merge fixes from Andrew Morton:
       "11 fixes"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
        Update maintainers for DRM STI driver
        mm: cma: mark cma_bitmap_maxno() inline in header
        zram: fix pool name truncation
        memory-hotplug: fix wrong edge when hot add a new node
        .mailmap: Andrey Ryabinin has moved
        ipc/sem.c: update/correct memory barriers
        mm/hwpoison: fix panic due to split huge zero page
        ipc,sem: remove uneeded sem_undo_list lock usage in exit_sem()
        ipc,sem: fix use after free on IPC_RMID after a task using same semaphore set exits
        mm/hwpoison: fix fail isolate hugetlbfs page w/ refcount held
        mm/hwpoison: fix page refcount of unknown non LRU page
      8394a1b7
    • Linus Torvalds's avatar
      Merge tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux · fbd9163f
      Linus Torvalds authored
      Pull clock fix from Stephen Boyd:
       "A one-liner for a regression found in the PXA clock driver"
      
      * tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
        clk: pxa: pxa3xx: fix CKEN register access
      fbd9163f
    • Benjamin Gaignard's avatar
      Update maintainers for DRM STI driver · 7f11c476
      Benjamin Gaignard authored
      
      
      Add Vincent Abriou and myself as maintainers.
      
      Signed-off-by: default avatarBenjamin Gaignard <benjamin.gaignard@linaro.org>
      Cc: Vincent Abriou <vincent.abriou@st.com>
      Cc: Dave Airlie <airlied@linux.ie>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7f11c476
    • Gregory Fong's avatar
      mm: cma: mark cma_bitmap_maxno() inline in header · f21838e0
      Gregory Fong authored
      
      
      cma_bitmap_maxno() was marked as static and not static inline, which can
      cause warnings about this function not being used if this file is included
      in a file that does not call that function, and violates the conventions
      used elsewhere.  The two options are to move the function implementation
      back to mm/cma.c or make it inline here, and it's simple enough for the
      latter to make sense.
      
      Signed-off-by: default avatarGregory Fong <gregory.0xf0@gmail.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f21838e0
    • Sergey Senozhatsky's avatar
      zram: fix pool name truncation · 4ce321f5
      Sergey Senozhatsky authored
      
      
      zram_meta_alloc() constructs a pool name for zs_create_pool() call as
      
          snprintf(pool_name, sizeof(pool_name), "zram%d", device_id);
      
      However, it defines pool name buffer to be only 8 bytes long (minus
      trailing zero), which means that we can have only 1000 pool names: zram0
      -- zram999.
      
      With CONFIG_ZSMALLOC_STAT enabled an attempt to create a device zram1000
      can fail if device zram100 already exists, because snprintf() will
      truncate new pool name to zram100 and pass it debugfs_create_dir(),
      causing:
      
        debugfs dir <zram100> creation failed
        zram: Error creating memory pool
      
      ... and so on.
      
      Fix it by passing zram->disk->disk_name to zram_meta_alloc() instead of
      divice_id.  We construct zram%d name earlier and keep it as a ->disk_name,
      no need to snprintf() it again.
      
      Signed-off-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4ce321f5
    • Xishi Qiu's avatar
      memory-hotplug: fix wrong edge when hot add a new node · f9126ab9
      Xishi Qiu authored
      
      
      When we add a new node, the edge of memory may be wrong.
      
      e.g. system has 4 nodes, and node3 is movable, node3 mem:[24G-32G],
      
      1. hotremove the node3,
      2. then hotadd node3 with a part of memory, mem:[26G-30G],
      3. call hotadd_new_pgdat()
              free_area_init_node()
                      get_pfn_range_for_nid()
      4. it will return wrong start_pfn and end_pfn, because we have not
      update the memblock.
      
      This patch also fixes a BUG_ON during hot-addition, please see
      http://marc.info/?l=linux-kernel&m=142961156129456&w=2
      
      Signed-off-by: default avatarXishi Qiu <qiuxishi@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Gu Zheng <guz.fnst@cn.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f9126ab9
    • Andrey Ryabinin's avatar
      .mailmap: Andrey Ryabinin has moved · 2baf9e89
      Andrey Ryabinin authored
      
      
      Update my email address.
      
      Signed-off-by: default avatarAndrey Ryabinin <ryabinin.a.a@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2baf9e89
    • Manfred Spraul's avatar
      ipc/sem.c: update/correct memory barriers · 3ed1f8a9
      Manfred Spraul authored
      
      
      sem_lock() did not properly pair memory barriers:
      
      !spin_is_locked() and spin_unlock_wait() are both only control barriers.
      The code needs an acquire barrier, otherwise the cpu might perform read
      operations before the lock test.
      
      As no primitive exists inside <include/spinlock.h> and since it seems
      noone wants another primitive, the code creates a local primitive within
      ipc/sem.c.
      
      With regards to -stable:
      
      The change of sem_wait_array() is a bugfix, the change to sem_lock() is a
      nop (just a preprocessor redefinition to improve the readability).  The
      bugfix is necessary for all kernels that use sem_wait_array() (i.e.:
      starting from 3.10).
      
      Signed-off-by: default avatarManfred Spraul <manfred@colorfullife.com>
      Reported-by: default avatarOleg Nesterov <oleg@redhat.com>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Kirill Tkhai <ktkhai@parallels.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: <stable@vger.kernel.org>	[3.10+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3ed1f8a9
    • Wanpeng Li's avatar
      mm/hwpoison: fix panic due to split huge zero page · 7f6bf39b
      Wanpeng Li authored
      Bug:
      
        ------------[ cut here ]------------
        kernel BUG at mm/huge_memory.c:1957!
        invalid opcode: 0000 [#1] SMP
        Modules linked in: snd_hda_codec_hdmi i915 rpcsec_gss_krb5 snd_hda_codec_realtek snd_hda_codec_generic nfsv4 dns_re
        CPU: 2 PID: 2576 Comm: test_huge Not tainted 4.2.0-rc5-mm1+ #27
        Hardware name: Dell Inc. OptiPlex 7020/0F5C5X, BIOS A03 01/08/2015
        task: ffff880204e3d600 ti: ffff8800db16c000 task.ti: ffff8800db16c000
        RIP: split_huge_page_to_list+0xdb/0x120
        Call Trace:
          memory_failure+0x32e/0x7c0
          madvise_hwpoison+0x8b/0x160
          SyS_madvise+0x40/0x240
          ? do_page_fault+0x37/0x90
          entry_SYSCALL_64_fastpath+0x12/0x71
        Code: ff f0 41 ff 4c 24 30 74 0d 31 c0 48 83 c4 08 5b 41 5c 41 5d c9 c3 4c 89 e7 e8 e2 58 fd ff 48 83 c4 08 31 c0
        RIP  split_huge_page_to_list+0xdb/0x120
         RSP <ffff8800db16fde8>
        ---[ end trace aee7ce0df8e44076 ]---
      
      Testcase:
      
          #define _GNU_SOURCE
          #include <stdlib.h>
          #include <stdio.h>
          #include <sys/mman.h>
          #include <unistd.h>
          #include <fcntl.h>
          #include <sys/types.h>
          #include <errno.h>
          #include <string.h>
      
          #define MB 1024*1024
      
          int main(void)
          {
                  char *mem;
      
                  posix_memalign((void **)&mem, 2 * MB, 200 * MB);
      
                  madvise(mem, 200 * MB, MADV_HWPOISON);
      
                  free(mem);
      
                  return 0;
          }
      
      Huge zero page is allocated if page fault w/o FAULT_FLAG_WRITE flag.
      The get_user_pages_fast() which called in madvise_hwpoison() will get
      huge zero page if the page is not allocated before.  Huge zero page is a
      tranparent huge page, however, it is not an anonymous page.
      memory_failure will split the huge zero page and trigger
      BUG_ON(is_huge_zero_page(page));
      
      After commit 98ed2b00
      
       ("mm/memory-failure: give up error handling
      for non-tail-refcounted thp"), memory_failure will not catch non anon
      thp from madvise_hwpoison path and this bug occur.
      
      Fix it by catching non anon thp in memory_failure in order to not split
      huge zero page in madvise_hwpoison path.
      
      After this patch:
      
        Injecting memory failure for page 0x202800 at 0x7fd8ae800000
        MCE: 0x202800: non anonymous thp
        [...]
      
      [akpm@linux-foundation.org: remove second split, per Wanpeng]
      Signed-off-by: default avatarWanpeng Li <wanpeng.li@hotmail.com>
      Acked-by: default avatarNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7f6bf39b