Skip to content
  1. Aug 01, 2012
  2. Jun 27, 2012
    • Tejun Heo's avatar
      blkcg: implement per-blkg request allocation · a051661c
      Tejun Heo authored
      Currently, request_queue has one request_list to allocate requests
      from regardless of blkcg of the IO being issued.  When the unified
      request pool is used up, cfq proportional IO limits become meaningless
      - whoever grabs the next request being freed wins the race regardless
      of the configured weights.
      
      This can be easily demonstrated by creating a blkio cgroup w/ very low
      weight, put a program which can issue a lot of random direct IOs there
      and running a sequential IO from a different cgroup.  As soon as the
      request pool is used up, the sequential IO bandwidth crashes.
      
      This patch implements per-blkg request_list.  Each blkg has its own
      request_list and any IO allocates its request from the matching blkg
      making blkcgs completely isolated in terms of request allocation.
      
      * Root blkcg uses the request_list embedded in each request_queue,
        which was renamed to @q->root_rl from @q->rq.  While making blkcg rl
        handling a bit harier, this enables avoiding most overhead for root
        blkcg.
      
      * Queue fullness is properly per request_list but bdi isn't blkcg
        aware yet, so congestion state currently just follows the root
        blkcg.  As writeback isn't aware of blkcg yet, this works okay for
        async congestion but readahead may get the wrong signals.  It's
        better than blkcg completely collapsing with shared request_list but
        needs to be improved with future changes.
      
      * After this change, each block cgroup gets a full request pool making
        resource consumption of each cgroup higher.  This makes allowing
        non-root users to create cgroups less desirable; however, note that
        allowing non-root users to directly manage cgroups is already
        severely broken regardless of this patch - each block cgroup
        consumes kernel memory and skews IO weight (IO weights are not
        hierarchical).
      
      v2: queue-sysfs.txt updated and patch description udpated as suggested
          by Vivek.
      
      v3: blk_get_rl() wasn't checking error return from
          blkg_lookup_create() and may cause oops on lookup failure.  Fix it
          by falling back to root_rl on blkg lookup failures.  This problem
          was spotted by Rakesh Iyer <rni@google.com>.
      
      v4: Updated to accomodate 458f27a9
      
       "block: Avoid missed wakeup in
          request waitqueue".  blk_drain_queue() now wakes up waiters on all
          blkg->rl on the target queue.
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a051661c
  3. Jun 25, 2012
    • Tejun Heo's avatar
      block: prepare for multiple request_lists · 5b788ce3
      Tejun Heo authored
      
      
      Request allocation is about to be made per-blkg meaning that there'll
      be multiple request lists.
      
      * Make queue full state per request_list.  blk_*queue_full() functions
        are renamed to blk_*rl_full() and takes @rl instead of @q.
      
      * Rename blk_init_free_list() to blk_init_rl() and make it take @rl
        instead of @q.  Also add @gfp_mask parameter.
      
      * Add blk_exit_rl() instead of destroying rl directly from
        blk_release_queue().
      
      * Add request_list->q and make request alloc/free functions -
        blk_free_request(), [__]freed_request(), __get_request() - take @rl
        instead of @q.
      
      This patch doesn't introduce any functional difference.
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      5b788ce3
    • Tejun Heo's avatar
      block: add q->nr_rqs[] and move q->rq.elvpriv to q->nr_rqs_elvpriv · 8a5ecdd4
      Tejun Heo authored
      
      
      Add q->nr_rqs[] which currently behaves the same as q->rq.count[] and
      move q->rq.elvpriv to q->nr_rqs_elvpriv.  blk_drain_queue() is updated
      to use q->nr_rqs[] instead of q->rq.count[].
      
      These counters separates queue-wide request statistics from the
      request list and allow implementation of per-queue request allocation.
      
      While at it, properly indent fields of struct request_list.
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8a5ecdd4
    • Tejun Heo's avatar
      blkcg: inline bio_blkcg() and friends · b1208b56
      Tejun Heo authored
      
      
      Make bio_blkcg() and friends inline.  They all are very simple and
      used only in few places.
      
      This patch is to prepare for further updates to request allocation
      path.
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b1208b56
    • Tejun Heo's avatar
      block: allocate io_context upfront · 7f4b35d1
      Tejun Heo authored
      
      
      Block layer very lazy allocation of ioc.  It waits until the moment
      ioc is absolutely necessary; unfortunately, that time could be inside
      queue lock and __get_request() performs unlock - try alloc - retry
      dancing.
      
      Just allocate it up-front on entry to block layer.  We're not saving
      the rain forest by deferring it to the last possible moment and
      complicating things unnecessarily.
      
      This patch is to prepare for further updates to request allocation
      path.
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7f4b35d1
    • Tejun Heo's avatar
      block: refactor get_request[_wait]() · a06e05e6
      Tejun Heo authored
      
      
      Currently, there are two request allocation functions - get_request()
      and get_request_wait().  The former tries to allocate a request once
      and the latter keeps retrying until it succeeds.  The latter wraps the
      former and keeps retrying until allocation succeeds.
      
      The combination of two functions deliver fallible non-wait allocation,
      fallible wait allocation and unfailing wait allocation.  However,
      given that forward progress is guaranteed, fallible wait allocation
      isn't all that useful and in fact nobody uses it.
      
      This patch simplifies the interface as follows.
      
      * get_request() is renamed to __get_request() and is only used by the
        wrapper function.
      
      * get_request_wait() is renamed to get_request().  It now takes
        @gfp_mask and retries iff it contains %__GFP_WAIT.
      
      This patch doesn't introduce any functional change and is to prepare
      for further updates to request allocation path.
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a06e05e6
    • Tejun Heo's avatar
      block: drop custom queue draining used by scsi_transport_{iscsi|fc} · 86072d81
      Tejun Heo authored
      
      
      iscsi_remove_host() uses bsg_remove_queue() which implements custom
      queue draining.  fc_bsg_remove() open-codes mostly identical logic.
      
      The draining logic isn't correct in that blk_stop_queue() doesn't
      prevent new requests from being queued - it just stops processing, so
      nothing prevents new requests to be queued after the logic determines
      that the queue is drained.
      
      blk_cleanup_queue() now implements proper queue draining and these
      custom draining logics aren't necessary.  Drop them and use
      bsg_unregister_queue() + blk_cleanup_queue() instead.
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarMike Christie <michaelc@cs.wisc.edu>
      Acked-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: James Smart <james.smart@emulex.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      86072d81
    • Tejun Heo's avatar
      mempool: add @gfp_mask to mempool_create_node() · a91a5ac6
      Tejun Heo authored
      
      
      mempool_create_node() currently assumes %GFP_KERNEL.  Its only user,
      blk_init_free_list(), is about to be updated to use other allocation
      flags - add @gfp_mask argument to the function.
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a91a5ac6
    • Tejun Heo's avatar
      blkcg: make root blkcg allocation use %GFP_KERNEL · 15974993
      Tejun Heo authored
      
      
      Currently, blkcg_activate_policy() depends on %GFP_ATOMIC allocation
      from __blkg_lookup_create() for root blkcg creation.  This could make
      policy fail unnecessarily.
      
      Make blkg_alloc() take @gfp_mask, __blkg_lookup_create() take an
      optional @new_blkg for preallocated blkg, and blkcg_activate_policy()
      preload radix tree and preallocate blkg with %GFP_KERNEL before trying
      to create the root blkg.
      
      v2: __blkg_lookup_create() was returning %NULL on blkg alloc failure
         instead of ERR_PTR() value.  Fixed.
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      15974993
    • Tejun Heo's avatar
      blkcg: __blkg_lookup_create() doesn't need radix preload · 13589864
      Tejun Heo authored
      
      
      There's no point in calling radix_tree_preload() if preloading doesn't
      use more permissible GFP mask.  Drop preloading from
      __blkg_lookup_create().
      
      While at it, drop sparse locking annotation which no longer applies.
      
      v2: Vivek pointed out the odd preload usage.  Instead of updating,
          just drop it.
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      13589864
  4. Jun 15, 2012
    • Jan Kara's avatar
      scsi: Silence unnecessary warnings about ioctl to partition · 6d935928
      Jan Kara authored
      
      
      Sometimes, warnings about ioctls to partition happen often enough that they
      form majority of the warnings in the kernel log and users complain. In some
      cases warnings are about ioctls such as SG_IO so it's not good to get rid of
      the warnings completely as they can ease debugging of userspace problems
      when ioctl is refused.
      
      Since I have seen warnings from lots of commands, including some proprietary
      userspace applications, I don't think disallowing the ioctls for processes
      with CAP_SYS_RAWIO will happen in the near future if ever. So lets just
      stop warning for processes with CAP_SYS_RAWIO for which ioctl is allowed.
      
      CC: Paolo Bonzini <pbonzini@redhat.com>
      CC: James Bottomley <JBottomley@parallels.com>
      CC: linux-scsi@vger.kernel.org
      Acked-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6d935928
    • Asias He's avatar
      block: Drop dead function blk_abort_queue() · 76aaa510
      Asias He authored
      This function was only used by btrfs code in btrfs_abort_devices()
      (seems in a wrong way).
      
      It was removed in commit d07eb911
      
      ,
      So, Let's remove the dead code to avoid any confusion.
      
      Changes in v2: update commit log, btrfs_abort_devices() was removed
      already.
      
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: linux-kernel@vger.kernel.org
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: linux-btrfs@vger.kernel.org
      Cc: David Sterba <dave@jikos.cz>
      Signed-off-by: default avatarAsias He <asias@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      76aaa510
    • Asias He's avatar
      block: Mitigate lock unbalance caused by lock switching · 5e5cfac0
      Asias He authored
      Commit 777eb1bf
      
       disconnects externally
      supplied queue_lock before blk_drain_queue(). Switching the lock would
      introduce lock unbalance because theads which have taken the external
      lock might unlock the internal lock in the during the queue drain. This
      patch mitigate this by disconnecting the lock after the queue draining
      since queue draining makes a lot of request_queue users go away.
      
      However, please note, this patch only makes the problem less likely to
      happen. Anyone who still holds a ref might try to issue a new request on
      a dead queue after the blk_cleanup_queue() finishes draining, the lock
      unbalance might still happen in this case.
      
       =====================================
       [ BUG: bad unlock balance detected! ]
       3.4.0+ #288 Not tainted
       -------------------------------------
       fio/17706 is trying to release lock (&(&q->__queue_lock)->rlock) at:
       [<ffffffff81329372>] blk_queue_bio+0x2a2/0x380
       but there are no more locks to release!
      
       other info that might help us debug this:
       1 lock held by fio/17706:
        #0:  (&(&vblk->lock)->rlock){......}, at: [<ffffffff81327f1a>]
       get_request_wait+0x19a/0x250
      
       stack backtrace:
       Pid: 17706, comm: fio Not tainted 3.4.0+ #288
       Call Trace:
        [<ffffffff81329372>] ? blk_queue_bio+0x2a2/0x380
        [<ffffffff810dea49>] print_unlock_inbalance_bug+0xf9/0x100
        [<ffffffff810dfe4f>] lock_release_non_nested+0x1df/0x330
        [<ffffffff811dae24>] ? dio_bio_end_aio+0x34/0xc0
        [<ffffffff811d6935>] ? bio_check_pages_dirty+0x85/0xe0
        [<ffffffff811daea1>] ? dio_bio_end_aio+0xb1/0xc0
        [<ffffffff81329372>] ? blk_queue_bio+0x2a2/0x380
        [<ffffffff81329372>] ? blk_queue_bio+0x2a2/0x380
        [<ffffffff810e0079>] lock_release+0xd9/0x250
        [<ffffffff81a74553>] _raw_spin_unlock_irq+0x23/0x40
        [<ffffffff81329372>] blk_queue_bio+0x2a2/0x380
        [<ffffffff81328faa>] generic_make_request+0xca/0x100
        [<ffffffff81329056>] submit_bio+0x76/0xf0
        [<ffffffff8115470c>] ? set_page_dirty_lock+0x3c/0x60
        [<ffffffff811d69e1>] ? bio_set_pages_dirty+0x51/0x70
        [<ffffffff811dd1a8>] do_blockdev_direct_IO+0xbf8/0xee0
        [<ffffffff811d8620>] ? blkdev_get_block+0x80/0x80
        [<ffffffff811dd4e5>] __blockdev_direct_IO+0x55/0x60
        [<ffffffff811d8620>] ? blkdev_get_block+0x80/0x80
        [<ffffffff811d92e7>] blkdev_direct_IO+0x57/0x60
        [<ffffffff811d8620>] ? blkdev_get_block+0x80/0x80
        [<ffffffff8114c6ae>] generic_file_aio_read+0x70e/0x760
        [<ffffffff810df7c5>] ? __lock_acquire+0x215/0x5a0
        [<ffffffff811e9924>] ? aio_run_iocb+0x54/0x1a0
        [<ffffffff8114bfa0>] ? grab_cache_page_nowait+0xc0/0xc0
        [<ffffffff811e82cc>] aio_rw_vect_retry+0x7c/0x1e0
        [<ffffffff811e8250>] ? aio_fsync+0x30/0x30
        [<ffffffff811e9936>] aio_run_iocb+0x66/0x1a0
        [<ffffffff811ea9b0>] do_io_submit+0x6f0/0xb80
        [<ffffffff8134de2e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
        [<ffffffff811eae50>] sys_io_submit+0x10/0x20
        [<ffffffff81a7c9e9>] system_call_fastpath+0x16/0x1b
      
      Changes since v2: Update commit log to explain how the code is still
                        broken even if we delay the lock switching after the drain.
      Changes since v1: Update commit log as Tejun suggested.
      
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAsias He <asias@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      5e5cfac0
    • Asias He's avatar
      block: Avoid missed wakeup in request waitqueue · 458f27a9
      Asias He authored
      
      
      After hot-unplug a stressed disk, I found that rl->wait[] is not empty
      while rl->count[] is empty and there are theads still sleeping on
      get_request after the queue cleanup. With simple debug code, I found
      there are exactly nr_sleep - nr_wakeup of theads in D state. So there
      are missed wakeup.
      
        $ dmesg | grep nr_sleep
        [   52.917115] ---> nr_sleep=1046, nr_wakeup=873, delta=173
        $ vmstat 1
        1 173  0 712640  24292  96172 0 0  0  0  419  757  0  0  0 100  0
      
      To quote Tejun:
      
        Ah, okay, freed_request() wakes up single waiter with the assumption
        that after the wakeup there will at least be one successful allocation
        which in turn will continue the wakeup chain until the wait list is
        empty - ie. waiter wakeup is dependent on successful request
        allocation happening after each wakeup.  With queue marked dead, any
        woken up waiter fails the allocation path, so the wakeup chaining is
        lost and we're left with hung waiters. What we need is wake_up_all()
        after drain completion.
      
      This patch fixes the missed wakeup by waking up all the theads which
      are sleeping on wait queue after queue drain.
      
      Changes in v2: Drop waitqueue_active() optimization
      
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAsias He <asias@redhat.com>
      
      Fixed a bug by me, where stacked devices would oops on calling
      blk_drain_queue() since ->rq.wait[] do not get initialized unless
      it's a full queue setup.
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      458f27a9
  5. Jun 14, 2012
  6. Jun 12, 2012
    • Lars Ellenberg's avatar
      drbd: fix null pointer dereference with on-congestion policy when diskless · 0d5934e3
      Lars Ellenberg authored
      
      
      We must not look at mdev->actlog, unless we have a get_ldev() reference.
      It also does not make much sense to try to disconnect or pull-ahead of
      the peer, if we don't have good local data.
      
      Only even consider congestion policies, if our local disk is D_UP_TO_DATE.
      
      Signed-off-by: default avatarPhilipp Reisner <philipp.reisner@linbit.com>
      Signed-off-by: default avatarLars Ellenberg <lars.ellenberg@linbit.com>
      0d5934e3
    • Lars Ellenberg's avatar
      drbd: fix list corruption by failing but already aborted reads · 1ed25b26
      Lars Ellenberg authored
      
      
      If a read is aborted due to force-detach of a supposedly unresponsive
      local backing device, and retried on the peer, it can happen that the
      local request later still completes (hopefully with an error).
      As it may already have been completed to upper layers meanwhile,
      it must not be retried again now.
      
      Signed-off-by: default avatarPhilipp Reisner <philipp.reisner@linbit.com>
      Signed-off-by: default avatarLars Ellenberg <lars.ellenberg@linbit.com>
      1ed25b26
    • Lars Ellenberg's avatar
      drbd: fix access of unallocated pages and kernel panic · 4eccc579
      Lars Ellenberg authored
      
      
      BUG: unable to handle kernel NULL pointer dereference at (null)
      ...
       [<d1e17561>] ? _drbd_bm_set_bits+0x151/0x240 [drbd]
       [<d1e236f8>] ? receive_bitmap+0x4f8/0xbc0 [drbd]
      
      This fixes an off-by-one error in the receive_bitmap() path,
      if run-length encoded bitmap transfer is enabled.
      
      If the bitmap is an exact multiple of PAGE_SIZE, which means the visible
      capacity of the drbd device is an exact multiple of 128 MiB (for 4k page
      size), and bitmap compression (use-rle) is enabled (which became default
      with 8.4), and the very last bit is dirty and reported in an rle
      comressed bitmap packet, we ended up trying to kmap_atomic a page pointer
      that does not exist (bitmap->bm_pages[last index + 1]).
      
      bug introduced by:
          Date:   Fri Jul 24 15:33:24 2009 +0200
          set bits: optimize for complete last word, fix off-by-one-word corner case
      
      made effective by:
          Date:   Thu Dec 16 00:32:38 2010 +0100
          drbd: get rid of unused debug code
      
          Long time ago, we had paranoia code in the bitmap that allocated one
          extra word, assigned a magic value, and checked on every occasion that
          the magic value was still unchanged.
      
          That debug code is unused, the extra long word complicates code a bit.
          Get rid of it.
      
      No-one triggered this bug in the last few years, because a large subset
      of our userbase is unaffected:
       * typically the last few blocks of a device are not modified
         frequently, and remain unset
       * use-rle was disabled by default in drbd < 8.4
       * those with slightly "odd" device sizes, or
       * drbd internal meta data (which will skew the device size slightly,
         thus makes it harder to have a bug relevant device size)
      
      Signed-off-by: default avatarPhilipp Reisner <philipp.reisner@linbit.com>
      Signed-off-by: default avatarLars Ellenberg <lars.ellenberg@linbit.com>
      4eccc579
    • Konrad Rzeszutek Wilk's avatar
      xen/blkfront: Add WARN to deal with misbehaving backends. · 6878c32e
      Konrad Rzeszutek Wilk authored
      
      
      Part of the ring structure is the 'id' field which is under
      control of the frontend. The frontend stamps it with "some"
      value (this some in this implementation being a value less
      than BLK_RING_SIZE), and when it gets a response expects
      said value to be in the response structure. We have a check
      for the id field when spolling new requests but not when
      de-spolling responses.
      
      We also add an extra check in add_id_to_freelist to make
      sure that the 'struct request' was not NULL - as we cannot
      pass a NULL to __blk_end_request_all, otherwise that crashes
      (and all the operations that the response is dealing with
      end up with __blk_end_request_all).
      
      Lastly we also print the name of the operation that failed.
      
      [v1: s/BUG/WARN/ suggested by Stefano]
      [v2: Add extra check in add_id_to_freelist]
      [v3: Redid op_name per Jan's suggestion]
      [v4: add const * and add WARN on failure returns]
      Acked-by: default avatarJan Beulich <jbeulich@suse.com>
      Acked-by: default avatarStefano Stabellini <stefano.stabellini@eu.citrix.com>
      Signed-off-by: default avatarKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      6878c32e
  7. Jun 06, 2012
  8. Jun 05, 2012
  9. Jun 04, 2012
    • Tejun Heo's avatar
      blkcg: fix blkg_alloc() failure path · 9b2ea86b
      Tejun Heo authored
      
      
      When policy data allocation fails in the middle, blkg_alloc() invokes
      blkg_free() to destroy the half constructed blkg.  This ends up
      calling pd_exit_fn() on policy datas which didn't go through
      pd_init_fn().  Fix it by making blkg_alloc() call pd_init_fn()
      immediately after each policy data allocation.
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9b2ea86b
    • Tejun Heo's avatar
      block: blkcg_policy_cfq shouldn't be used if !CONFIG_CFQ_GROUP_IOSCHED · ffea73fc
      Tejun Heo authored
      
      
      cfq may be built w/ or w/o blkcg support depending on
      CONFIG_CFQ_CGROUP_IOSCHED.  If blkcg support is disabled, most of
      related code is ifdef'd out but some part is left dangling -
      blkcg_policy_cfq is left zero-filled and blkcg_policy_[un]register()
      calls are made on it.
      
      Feeding zero filled policy to blkcg_policy_register() is incorrect and
      triggers the following WARN_ON() if CONFIG_BLK_CGROUP &&
      !CONFIG_CFQ_GROUP_IOSCHED.
      
       ------------[ cut here ]------------
       WARNING: at block/blk-cgroup.c:867
       Modules linked in:
       Modules linked in:
       CPU: 3 Not tainted 3.4.0-09547-gfb21aff #1
       Process swapper/0 (pid: 1, task: 000000003ff80000, ksp: 000000003ff7f8b8)
       Krnl PSW : 0704100180000000 00000000003d76ca (blkcg_policy_register+0xca/0xe0)
      	    R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:1 PM:0 EA:3
       Krnl GPRS: 0000000000000000 00000000014b85ec 00000000014b85b0 0000000000000000
      	    000000000096fb60 0000000000000000 00000000009a8e78 0000000000000048
      	    000000000099c070 0000000000b6f000 0000000000000000 000000000099c0b8
      	    00000000014b85b0 0000000000667580 000000003ff7fd98 000000003ff7fd70
       Krnl Code: 00000000003d76be: a7280001           lhi     %r2,1
      	    00000000003d76c2: a7f4ffdf           brc     15,3d7680
      	   #00000000003d76c6: a7f40001           brc     15,3d76c8
      	   >00000000003d76ca: a7c8ffea           lhi     %r12,-22
      	    00000000003d76ce: a7f4ffce           brc     15,3d766a
      	    00000000003d76d2: a7f40001           brc     15,3d76d4
      	    00000000003d76d6: a7c80000           lhi     %r12,0
      	    00000000003d76da: a7f4ffc2           brc     15,3d765e
       Call Trace:
       ([<0000000000b6f000>] initcall_debug+0x0/0x4)
        [<0000000000989e8a>] cfq_init+0x62/0xd4
        [<00000000001000ba>] do_one_initcall+0x3a/0x170
        [<000000000096fb60>] kernel_init+0x214/0x2bc
        [<0000000000623202>] kernel_thread_starter+0x6/0xc
        [<00000000006231fc>] kernel_thread_starter+0x0/0xc
       no locks held by swapper/0/1.
       Last Breaking-Event-Address:
        [<00000000003d76c6>] blkcg_policy_register+0xc6/0xe0
       ---[ end trace b8ef4903fcbf9dd3 ]---
      
      This patch fixes the problem by ensuring all blkcg support code is
      inside CONFIG_CFQ_GROUP_IOSCHED.
      
      * blkcg_policy_cfq declaration and blkg_to_cfqg() definition are moved
        inside the first CONFIG_CFQ_GROUP_IOSCHED block.  __maybe_unused is
        dropped from blkcg_policy_cfq decl.
      
      * blkcg_deactivate_poilcy() invocation is moved inside ifdef.  This
        also makes the activation logic match cfq_init_queue().
      
      * All blkcg_policy_[un]register() invocations are moved inside ifdef.
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarHeiko Carstens <heiko.carstens@de.ibm.com>
      LKML-Reference: <20120601112954.GC3535@osiris.boeblingen.de.ibm.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ffea73fc
    • Tejun Heo's avatar
      block: fix return value on cfq_init() failure · fd794956
      Tejun Heo authored
      
      
      cfq_init() would return zero after kmem cache creation failure.  Fix
      so that it returns -ENOMEM.
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      fd794956
    • Sachin Kamat's avatar
      mtip32xx: Remove version.h header file inclusion · 87c9ea76
      Sachin Kamat authored
      
      
      version.h header file inclusion is no longer required.
      
      Signed-off-by: default avatarSachin Kamat <sachin.kamat@linaro.org>
      87c9ea76
    • Kukjin Kim's avatar
      gpio/samsung: fix the typo 'exynos5_xxx' instead of 'exonys5_xxx' · 5041caa4
      Kukjin Kim authored
      Should be 'exynos5_xxx' instead of 'exonys5_xxx'.
      
      It happened at the commit 30b84288 ("Merge tag 'soc2' of
      git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc"
      
      )
      during v3.5 merge window.
      
      Signed-off-by: default avatarKukjin Kim <kgene.kim@samsung.com>
      [ My bad  - Linus ]
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5041caa4
    • Linus Torvalds's avatar
      Merge branch 'pm-acpi' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · 4d578573
      Linus Torvalds authored
      Pull some left-over PM patches from Rafael J. Wysocki.
      
      * 'pm-acpi' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
        ACPI / PM: Make acpi_pm_device_sleep_state() follow the specification
        ACPI / PM: Make __acpi_bus_get_power() cover D3cold correctly
        ACPI / PM: Fix error messages in drivers/acpi/bus.c
        rtc-cmos / PM: report wakeup event on ACPI RTC alarm
        ACPI / PM: Generate wakeup events on fixed power button
      4d578573
    • Linus Torvalds's avatar
      Revert "mm: compaction: handle incorrect MIGRATE_UNMOVABLE type pageblocks" · 68e3e926
      Linus Torvalds authored
      This reverts commit 5ceb9ce6
      
      .
      
      That commit seems to be the cause of the mm compation list corruption
      issues that Dave Jones reported.  The locking (or rather, absense
      there-of) is dubious, as is the use of the 'page' variable once it has
      been found to be outside the pageblock range.
      
      So revert it for now, we can re-visit this for 3.6.  If we even need to:
      as Minchan Kim says, "The patch wasn't a bug fix and even test workload
      was very theoretical".
      
      Reported-and-tested-by: default avatarDave Jones <davej@redhat.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarKOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Cc: Kyungmin Park <kyungmin.park@samsung.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      68e3e926
    • Hugh Dickins's avatar
      mm: fix warning in __set_page_dirty_nobuffers · 752dc185
      Hugh Dickins authored
      
      
      New tmpfs use of !PageUptodate pages for fallocate() is triggering the
      WARNING: at mm/page-writeback.c:1990 when __set_page_dirty_nobuffers()
      is called from migrate_page_copy() for compaction.
      
      It is anomalous that migration should use __set_page_dirty_nobuffers()
      on an address_space that does not participate in dirty and writeback
      accounting; and this has also been observed to insert surprising dirty
      tags into a tmpfs radix_tree, despite tmpfs not using tags at all.
      
      We should probably give migrate_page_copy() a better way to preserve the
      tag and migrate accounting info, when mapping_cap_account_dirty().  But
      that needs some more work: so in the interim, avoid the warning by using
      a simple SetPageDirty on PageSwapBacked pages.
      
      Reported-and-tested-by: default avatarDave Jones <davej@redhat.com>
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      752dc185
    • Linus Torvalds's avatar
      vfs: move inode stat information closer together · 2f9d3df8
      Linus Torvalds authored
      
      
      The comment above it says "Stat data, not accessed from path walking",
      but in fact some of inode fields we use for the common stat data was way
      down at the end of the inode, causing unnecessary cache misses for the
      common stat operations.
      
      The inode structure is pretty big, and this can change padding depending
      on field width, but at least on the common 64-bit configurations this
      doesn't change the size.  Some of our inode layout has historically been
      to tro to avoid unnecessary padding fields, but cache locality is at
      least as important for layout, if not more.
      
      Noticed by looking at kernel profiles, and noticing that the "i_blkbits"
      access stood out like a sore thumb.
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2f9d3df8
  10. Jun 03, 2012
    • Linus Torvalds's avatar
      Linux 3.5-rc1 · f8f5701b
      Linus Torvalds authored
      f8f5701b
    • Linus Torvalds's avatar
      Merge tag 'dm-3.5-changes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm · 912afc36
      Linus Torvalds authored
      Pull device-mapper updates from Alasdair G Kergon:
       "Improve multipath's retrying mechanism in some defined circumstances
        and provide a simple reserve/release mechanism for userspace tools to
        access thin provisioning metadata while the pool is in use."
      
      * tag 'dm-3.5-changes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm:
        dm thin: provide userspace access to pool metadata
        dm thin: use slab mempools
        dm mpath: allow ioctls to trigger pg init
        dm mpath: delay retry of bypassed pg
        dm mpath: reduce size of struct multipath
      912afc36
    • Joe Thornber's avatar
      dm thin: provide userspace access to pool metadata · cc8394d8
      Joe Thornber authored
      This patch implements two new messages that can be sent to the thin
      pool target allowing it to take a snapshot of the _metadata_.  This,
      read-only snapshot can be accessed by userland, concurrently with the
      live target.
      
      Only one metadata snapshot can be held at a time.  The pool's status
      line will give the block location for the current msnap.
      
      Since version 0.1.5 of the userland thin provisioning tools, the
      thin_dump program displays the msnap as follows:
      
          thin_dump -m <msnap root> <metadata dev>
      
      Available here: https://github.com/jthornber/thin-provisioning-tools
      
      Now that userland can access the metadata we can do various things
      that have traditionally been kernel side tasks:
      
           i) Incremental backups.
      
           By using metadata snapshots we can work out what blocks have
           changed over time.  Combined with data snapshots we can ensure
           the data doesn't change while we back it up.
      
           A short proof of concept script can be found here:
      
           https://github.com/jthornber/thinp-test-suite/blob/master/incremental_backup_example.rb
      
      
      
           ii) Migration of thin devices from one pool to another.
      
           iii) Merging snapshots back into an external origin.
      
           iv) Asyncronous replication.
      
      Signed-off-by: default avatarJoe Thornber <ejt@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      cc8394d8
    • Mike Snitzer's avatar
      dm thin: use slab mempools · a24c2569
      Mike Snitzer authored
      
      
      Use dedicated caches prefixed with a "dm_" name rather than relying on
      kmalloc mempools backed by generic slab caches so the memory usage of
      thin provisioning (and any leaks) can be accounted for independently.
      
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      a24c2569
    • Mikulas Patocka's avatar
      dm mpath: allow ioctls to trigger pg init · 35991652
      Mikulas Patocka authored
      
      
      After the failure of a group of paths, any alternative paths that
      need initialising do not become available until further I/O is sent to
      the device.  Until this has happened, ioctls return -EAGAIN.
      
      With this patch, new paths are made available in response to an ioctl
      too.  The processing of the ioctl gets delayed until this has happened.
      
      Instead of returning an error, we submit a work item to kmultipathd
      (that will potentially activate the new path) and retry in ten
      milliseconds.
      
      Note that the patch doesn't retry an ioctl if the ioctl itself fails due
      to a path failure.  Such retries should be handled intelligently by the
      code that generated the ioctl in the first place, noting that some SCSI
      commands should not be retried because they are not idempotent (XOR write
      commands).  For commands that could be retried, there is a danger that
      if the device rejected the SCSI command, the path could be errorneously
      marked as failed, and the request would be retried on another path which
      might fail too.  It can be determined if the failure happens on the
      device or on the SCSI controller, but there is no guarantee that all
      SCSI drivers set these flags correctly.
      
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAlasdair G Kergon <agk@redhat.com>
      35991652