Skip to content
  1. Sep 26, 2018
  2. Sep 25, 2018
  3. Sep 22, 2018
  4. Sep 21, 2018
  5. Sep 20, 2018
  6. Sep 15, 2018
    • Paolo Valente's avatar
      blok, bfq: do not plug I/O if all queues are weight-raised · c8765de0
      Paolo Valente authored
      
      
      To reduce latency for interactive and soft real-time applications, bfq
      privileges the bfq_queues containing the I/O of these
      applications. These privileged queues, referred-to as weight-raised
      queues, get a much higher share of the device throughput
      w.r.t. non-privileged queues. To preserve this higher share, the I/O
      of any non-weight-raised queue must be plugged whenever a sync
      weight-raised queue, while being served, remains temporarily empty. To
      attain this goal, bfq simply plugs any I/O (from any queue), if a sync
      weight-raised queue remains empty while in service.
      
      Unfortunately, this plugging typically lowers throughput with random
      I/O, on devices with internal queueing (because it reduces the filling
      level of the internal queues of the device).
      
      This commit addresses this issue by restricting the cases where
      plugging is performed: if a sync weight-raised queue remains empty
      while in service, then I/O plugging is performed only if some of the
      active bfq_queues are *not* weight-raised (which is actually the only
      circumstance where plugging is needed to preserve the higher share of
      the throughput of weight-raised queues). This restriction proved able
      to boost throughput in really many use cases needing only maximum
      throughput.
      
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c8765de0
    • Paolo Valente's avatar
      block, bfq: inject other-queue I/O into seeky idle queues on NCQ flash · d0edc247
      Paolo Valente authored
      
      
      The Achilles' heel of BFQ is its failing to reach a high throughput
      with sync random I/O on flash storage with internal queueing, in case
      the processes doing I/O have differentiated weights.
      
      The cause of this failure is as follows. If at least two processes do
      sync I/O, and have a different weight from each other, then BFQ plugs
      I/O dispatching every time one of these processes, while it is being
      served, remains temporarily without pending I/O requests. This
      plugging is necessary to guarantee that every process enjoys a
      bandwidth proportional to its weight; but it empties the internal
      queue(s) of the drive. And this kills throughput with random I/O. So,
      if some processes have differentiated weights and do both sync and
      random I/O, the end result is a throughput collapse.
      
      This commit tries to counter this problem by injecting the service of
      other processes, in a controlled way, while the process in service
      happens to have no I/O. This injection is performed only if the medium
      is non rotational and performs internal queueing, and the process in
      service does random I/O (service injection might be beneficial for
      sequential I/O too, we'll work on that).
      
      As an example of the benefits of this commit, on a PLEXTOR PX-256M5S
      SSD, and with five processes having differentiated weights and doing
      sync random 4KB I/O, this commit makes the throughput with bfq grow by
      400%, from 25 to 100MB/s. This higher throughput is 10MB/s lower than
      that reached with none. As some less random I/O is added to the mix,
      the throughput becomes equal to or higher than that with none.
      
      This commit is a very first attempt to recover throughput without
      losing control, and certainly has many limitations. One is, e.g., that
      the processes whose service is injected are not chosen so as to
      distribute the extra bandwidth they receive in accordance to their
      weights. Thus there might be loss of weighted fairness in some
      cases. Anyway, this loss concerns extra service, which would not have
      been received at all without this commit. Other limitations and issues
      will probably show up with usage.
      
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d0edc247
    • Paolo Valente's avatar
      block, bfq: correctly charge and reset entity service in all cases · cbeb869a
      Paolo Valente authored
      
      
      BFQ schedules entities (which represent either per-process queues or
      groups of queues) as a function of their timestamps. In particular, as
      a function of their (virtual) finish times. The finish time of an
      entity is computed as a function of the budget assigned to the entity,
      assuming, tentatively, that the entity, once in service, will receive
      an amount of service equal to its budget. Then, when the entity is
      expired because it finishes to be served, this finish time is updated
      as a function of the actual service received by the entity. This
      allows the entity to be correctly charged with only the service
      received, and then to be correctly re-scheduled.
      
      Yet an entity may receive service also while not being the entity in
      service (in the scheduling environment of its parent entity), for
      several reasons. If the entity remains with no backlog while receiving
      this 'unofficial' service, then it is expired. Also on such an
      expiration, the finish time of the entity should be updated to account
      for only the service actually received by the entity. Unfortunately,
      such an update is not performed for an entity expiring without being
      the entity in service.
      
      In a similar vein, the service counter of the entity in service is
      reset when the entity is expired, to be ready to be used for next
      service cycle. This reset too should be performed also in case an
      entity is expired because it remains empty after receiving service
      while not being the entity in service. But in this case the reset is
      not performed.
      
      This commit performs the above update of the finish time and reset of
      the service received, also for an entity expiring while not being the
      entity in service.
      
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      cbeb869a
  7. Sep 14, 2018
  8. Sep 12, 2018
  9. Sep 08, 2018
  10. Sep 07, 2018
    • Ming Lei's avatar
      block: remove bio_rewind_iter() · 7759eb23
      Ming Lei authored
      It is pointed that bio_rewind_iter() is one very bad API[1]:
      
      1) bio size may not be restored after rewinding
      
      2) it causes some bogus change, such as 5151842b
      
       (block: reset
      bi_iter.bi_done after splitting bio)
      
      3) rewinding really makes things complicated wrt. bio splitting
      
      4) unnecessary updating of .bi_done in fast path
      
      [1] https://marc.info/?t=153549924200005&r=1&w=2
      
      So this patch takes Kent's suggestion to restore one bio into its original
      state via saving bio iterator(struct bvec_iter) in bio_integrity_prep(),
      given now bio_rewind_iter() is only used by bio integrity code.
      
      Cc: Dmitry Monakhov <dmonakhov@openvz.org>
      Cc: Hannes Reinecke <hare@suse.com>
      Suggested-by: default avatarKent Overstreet <kent.overstreet@gmail.com>
      Acked-by: default avatarKent Overstreet <kent.overstreet@gmail.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7759eb23
    • Kees Cook's avatar
      drbd: Convert from ahash to shash · 3d0e6375
      Kees Cook authored
      
      
      In preparing to remove all stack VLA usage from the kernel[1], this
      removes the discouraged use of AHASH_REQUEST_ON_STACK in favor of
      the smaller SHASH_DESC_ON_STACK by converting from ahash-wrapped-shash
      to direct shash. By removing a layer of indirection this both improves
      performance and reduces stack usage. The stack allocation will be made
      a fixed size in a later patch to the crypto subsystem.
      
      The bulk of the lines in this change are simple s/ahash/shash/, but the
      main logic differences are in drbd_csum_ee() and drbd_csum_bio(), which
      externalizes the page walking with k(un)map_atomic() instead of using
      scattergather.
      
      [1] https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qPXydAacU1RqZWA@mail.gmail.com
      
      Acked-by: default avatarLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3d0e6375
    • Linus Torvalds's avatar
      Merge tag 'for-linus-20180906' of git://git.kernel.dk/linux-block · ca16eb34
      Linus Torvalds authored
      Pull block fixes from Jens Axboe:
       "Small collection of fixes that should go into this release. This
        contains:
      
         - Small series that fixes a race between blkcg teardown and writeback
           (Dennis Zhou)
      
         - Fix disallowing invalid block size settings from the nbd ioctl (me)
      
         - BFQ fix for a use-after-free on last release of a bfqg (Konstantin
           Khlebnikov)
      
         - Fix for the "don't warn for flush" fix (Mikulas)"
      
      * tag 'for-linus-20180906' of git://git.kernel.dk/linux-block:
        block: bfq: swap puts in bfqg_and_blkg_put
        block: don't warn when doing fsync on read-only devices
        nbd: don't allow invalid blocksize settings
        blkcg: use tryget logic when associating a blkg with a bio
        blkcg: delay blkg destruction until after writeback has finished
        Revert "blk-throttle: fix race between blkcg_bio_issue_check() and cgroup_rmdir()"
      ca16eb34
    • Konstantin Khlebnikov's avatar
      block: bfq: swap puts in bfqg_and_blkg_put · d5274b3c
      Konstantin Khlebnikov authored
      Fix trivial use-after-free. This could be last reference to bfqg.
      
      Fixes: 8f9bebc3
      
       ("block, bfq: access and cache blkg data only when safe")
      Acked-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d5274b3c
    • Linus Torvalds's avatar
      Merge tag 'apparmor-pr-2018-09-06' of... · db44bf4b
      Linus Torvalds authored
      Merge tag 'apparmor-pr-2018-09-06' of git://git.kernel.org/pub/scm/linux/kernel/git/jj/linux-apparmor
      
      Pull apparmor fix from John Johansen:
       "A fix for an issue syzbot discovered last week:
      
         - Fix for bad debug check when converting secids to secctx"
      
      * tag 'apparmor-pr-2018-09-06' of git://git.kernel.org/pub/scm/linux/kernel/git/jj/linux-apparmor:
        apparmor: fix bad debug check in apparmor_secid_to_secctx()
      db44bf4b
    • Linus Torvalds's avatar
      Merge tag 'trace-v4.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace · be65e259
      Linus Torvalds authored
      Pull tracing fixes from Steven Rostedt:
       "This fixes two annoying bugs:
      
         - The first one is a side effect caused by using SRCU for rcuidle
           tracepoints. It seems that the perf was depending on the rcuidle
           tracepoints to make RCU watch when it wasn't.
      
           The real fix will be to have perf use SRCU instead of depending on
           RCU watching, but that can't be done until SRCU is safe to use in
           NMI context (Paul's working on that).
      
         - The second bug fix is for a bug that's been periodically making my
           tests fail randomly for some time. I haven't had time to track it
           down, but finally have. It has to do with stressing NMIs (via perf)
           while enabling or disabling ftrace function handling with lockdep
           enabled.
      
           If an interrupt happens and just as it returns, it sets lockdep
           back to "interrupts enabled" but before it returns an NMI is
           triggered, and if this happens while printk_nmi_enter has a
           breakpoint attached to it (because ftrace is converting it to or
           from nop to call fentry), the breakpoint trap also calls into
           lockdep, and since returning from the NMI to a interrupt handler,
           interrupts were disabled when the NMI went off, lockdep keeps its
           state as interrupts disabled when it returns back from the
           interrupt handler where interrupts are enabled.
      
           This causes lockdep_assert_irqs_enabled() to trigger a false
           positive"
      
      * tag 'trace-v4.19-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
        printk/tracing: Do not trace printk_nmi_enter()
        tracing: Add back in rcu_irq_enter/exit_irqson() for rcuidle tracepoints
      be65e259