Commit df668a5f authored by Linus Torvalds's avatar Linus Torvalds
Browse files

Merge tag 'for-5.14/block-2021-06-29' of git://git.kernel.dk/linux-block

Pull core block updates from Jens Axboe:

 - disk events cleanup (Christoph)

 - gendisk and request queue allocation simplifications (Christoph)

 - bdev_disk_changed cleanups (Christoph)

 - IO priority improvements (Bart)

 - Chained bio completion trace fix (Edward)

 - blk-wbt fixes (Jan)

 - blk-wbt enable/disable fix (Zhang)

 - Scheduler dispatch improvements (Jan, Ming)

 - Shared tagset scheduler improvements (John)

 - BFQ updates (Paolo, Luca, Pietro)

 - BFQ lock inversion fix (Jan)

 - Documentation improvements (Kir)

 - CLONE_IO block cgroup fix (Tejun)

 - Remove of ancient and deprecated block dump feature (zhangyi)

 - Discard merge fix (Ming)

 - Misc fixes or followup fixes (Colin, Damien, Dan, Long, Max, Thomas,
   Yang)

* tag 'for-5.14/block-2021-06-29' of git://git.kernel.dk/linux-block: (129 commits)
  block: fix discard request merge
  block/mq-deadline: Remove a WARN_ON_ONCE() call
  blk-mq: update hctx->dispatch_busy in case of real scheduler
  blk: Fix lock inversion between ioc lock and bfqd lock
  bfq: Remove merged request already in bfq_requests_merged()
  block: pass a gendisk to bdev_disk_changed
  block: move bdev_disk_changed
  block: add the events* attributes to disk_attrs
  block: move the disk events code to a separate file
  block: fix trace completion for chained bio
  block/partitions/msdos: Fix typo inidicator -> indicator
  block, bfq: reset waker pointer with shared queues
  block, bfq: check waker only for queues with no in-flight I/O
  block, bfq: avoid delayed merge of async queues
  block, bfq: boost throughput by extending queue-merging times
  block, bfq: consider also creation time in delayed stable merge
  block, bfq: fix delayed stable merge check
  block, bfq: let also stably merged queues enjoy weight raising
  blk-wbt: make sure throttle is enabled properly
  blk-wbt: introduce a new disable state to prevent false positive by rwb_enabled()
  ...
parents df04fbe8 2705dfb2
Loading
Loading
Loading
Loading
+80 −75
Original line number Diff line number Diff line
@@ -17,29 +17,30 @@ level logical devices like device mapper.

HOWTO
=====

Throttling/Upper Limit policy
-----------------------------
- Enable Block IO controller::
Enable Block IO controller::

	CONFIG_BLK_CGROUP=y

- Enable throttling in block layer::
Enable throttling in block layer::

	CONFIG_BLK_DEV_THROTTLING=y

- Mount blkio controller (see cgroups.txt, Why are cgroups needed?)::
Mount blkio controller (see cgroups.txt, Why are cgroups needed?)::

        mount -t cgroup -o blkio none /sys/fs/cgroup/blkio

- Specify a bandwidth rate on particular device for root group. The format
Specify a bandwidth rate on particular device for root group. The format
for policy is "<major>:<minor>  <bytes_per_second>"::

        echo "8:16  1048576" > /sys/fs/cgroup/blkio/blkio.throttle.read_bps_device

  Above will put a limit of 1MB/second on reads happening for root group
This will put a limit of 1MB/second on reads happening for root group
on device having major/minor number 8:16.

- Run dd to read a file and see if rate is throttled to 1MB/s or not::
Run dd to read a file and see if rate is throttled to 1MB/s or not::

        # dd iflag=direct if=/mnt/common/zerofile of=/dev/null bs=4K count=1024
        1024+0 records in
@@ -79,85 +80,89 @@ following::

Various user visible config options
===================================

  CONFIG_BLK_CGROUP
	- Block IO controller.
	  Block IO controller.

  CONFIG_BFQ_CGROUP_DEBUG
	- Debug help. Right now some additional stats file show up in cgroup
	  Debug help. Right now some additional stats file show up in cgroup
	  if this option is enabled.

  CONFIG_BLK_DEV_THROTTLING
	- Enable block device throttling support in block layer.
	  Enable block device throttling support in block layer.

Details of cgroup files
=======================

Proportional weight policy files
--------------------------------
- blkio.weight
	- Specifies per cgroup weight. This is default weight of the group
	  on all the devices until and unless overridden by per device rule.
	  (See blkio.weight_device).
	  Currently allowed range of weights is from 10 to 1000.

- blkio.weight_device
	- One can specify per cgroup per device rules using this interface.
	  These rules override the default value of group weight as specified
	  by blkio.weight.
  blkio.bfq.weight
	  Specifies per cgroup weight. This is default weight of the group
	  on all the devices until and unless overridden by per device rule
	  (see `blkio.bfq.weight_device` below).

	  Currently allowed range of weights is from 1 to 1000. For more details,
          see Documentation/block/bfq-iosched.rst.

  blkio.bfq.weight_device
          Specifes per cgroup per device weights, overriding the default group
          weight. For more details, see Documentation/block/bfq-iosched.rst.

	  Following is the format::

	    # echo dev_maj:dev_minor weight > blkio.weight_device
	    # echo dev_maj:dev_minor weight > blkio.bfq.weight_device

	  Configure weight=300 on /dev/sdb (8:16) in this cgroup::

	    # echo 8:16 300 > blkio.weight_device
	    # cat blkio.weight_device
	    # echo 8:16 300 > blkio.bfq.weight_device
	    # cat blkio.bfq.weight_device
	    dev     weight
	    8:16    300

	  Configure weight=500 on /dev/sda (8:0) in this cgroup::

	    # echo 8:0 500 > blkio.weight_device
	    # cat blkio.weight_device
	    # echo 8:0 500 > blkio.bfq.weight_device
	    # cat blkio.bfq.weight_device
	    dev     weight
	    8:0     500
	    8:16    300

	  Remove specific weight for /dev/sda in this cgroup::

	    # echo 8:0 0 > blkio.weight_device
	    # cat blkio.weight_device
	    # echo 8:0 0 > blkio.bfq.weight_device
	    # cat blkio.bfq.weight_device
	    dev     weight
	    8:16    300

- blkio.time
	- disk time allocated to cgroup per device in milliseconds. First
  blkio.time
	  Disk time allocated to cgroup per device in milliseconds. First
	  two fields specify the major and minor number of the device and
	  third field specifies the disk time allocated to group in
	  milliseconds.

- blkio.sectors
	- number of sectors transferred to/from disk by the group. First
  blkio.sectors
	  Number of sectors transferred to/from disk by the group. First
	  two fields specify the major and minor number of the device and
	  third field specifies the number of sectors transferred by the
	  group to/from the device.

- blkio.io_service_bytes
	- Number of bytes transferred to/from the disk by the group. These
  blkio.io_service_bytes
	  Number of bytes transferred to/from the disk by the group. These
	  are further divided by the type of operation - read or write, sync
	  or async. First two fields specify the major and minor number of the
	  device, third field specifies the operation type and the fourth field
	  specifies the number of bytes.

- blkio.io_serviced
	- Number of IOs (bio) issued to the disk by the group. These
  blkio.io_serviced
	  Number of IOs (bio) issued to the disk by the group. These
	  are further divided by the type of operation - read or write, sync
	  or async. First two fields specify the major and minor number of the
	  device, third field specifies the operation type and the fourth field
	  specifies the number of IOs.

- blkio.io_service_time
	- Total amount of time between request dispatch and request completion
  blkio.io_service_time
	  Total amount of time between request dispatch and request completion
	  for the IOs done by this cgroup. This is in nanoseconds to make it
	  meaningful for flash devices too. For devices with queue depth of 1,
	  this time represents the actual service time. When queue_depth > 1,
@@ -170,8 +175,8 @@ Proportional weight policy files
	  specifies the operation type and the fourth field specifies the
	  io_service_time in ns.

- blkio.io_wait_time
	- Total amount of time the IOs for this cgroup spent waiting in the
  blkio.io_wait_time
	  Total amount of time the IOs for this cgroup spent waiting in the
	  scheduler queues for service. This can be greater than the total time
	  elapsed since it is cumulative io_wait_time for all IOs. It is not a
	  measure of total time the cgroup spent waiting but rather a measure of
@@ -185,24 +190,24 @@ Proportional weight policy files
	  minor number of the device, third field specifies the operation type
	  and the fourth field specifies the io_wait_time in ns.

- blkio.io_merged
	- Total number of bios/requests merged into requests belonging to this
  blkio.io_merged
	  Total number of bios/requests merged into requests belonging to this
	  cgroup. This is further divided by the type of operation - read or
	  write, sync or async.

- blkio.io_queued
	- Total number of requests queued up at any given instant for this
  blkio.io_queued
	  Total number of requests queued up at any given instant for this
	  cgroup. This is further divided by the type of operation - read or
	  write, sync or async.

- blkio.avg_queue_size
	- Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y.
  blkio.avg_queue_size
	  Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y.
	  The average queue size for this cgroup over the entire time of this
	  cgroup's existence. Queue size samples are taken each time one of the
	  queues of this cgroup gets a timeslice.

- blkio.group_wait_time
	- Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y.
  blkio.group_wait_time
	  Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y.
	  This is the amount of time the cgroup had to wait since it became busy
	  (i.e., went from 0 to 1 request queued) to get a timeslice for one of
	  its queues. This is different from the io_wait_time which is the
@@ -212,8 +217,8 @@ Proportional weight policy files
	  will only report the group_wait_time accumulated till the last time it
	  got a timeslice and will not include the current delta.

- blkio.empty_time
	- Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y.
  blkio.empty_time
	  Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y.
	  This is the amount of time a cgroup spends without any pending
	  requests when not being served, i.e., it does not include any time
	  spent idling for one of the queues of the cgroup. This is in
@@ -221,8 +226,8 @@ Proportional weight policy files
	  the stat will only report the empty_time accumulated till the last
	  time it had a pending request and will not include the current delta.

- blkio.idle_time
	- Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y.
  blkio.idle_time
	  Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y.
	  This is the amount of time spent by the IO scheduler idling for a
	  given cgroup in anticipation of a better request than the existing ones
	  from other queues/cgroups. This is in nanoseconds. If this is read
@@ -230,43 +235,43 @@ Proportional weight policy files
	  idle_time accumulated till the last idle period and will not include
	  the current delta.

- blkio.dequeue
	- Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y. This
  blkio.dequeue
	  Debugging aid only enabled if CONFIG_BFQ_CGROUP_DEBUG=y. This
	  gives the statistics about how many a times a group was dequeued
	  from service tree of the device. First two fields specify the major
	  and minor number of the device and third field specifies the number
	  of times a group was dequeued from a particular device.

- blkio.*_recursive
	- Recursive version of various stats. These files show the
  blkio.*_recursive
	  Recursive version of various stats. These files show the
          same information as their non-recursive counterparts but
          include stats from all the descendant cgroups.

Throttling/Upper limit policy files
-----------------------------------
- blkio.throttle.read_bps_device
	- Specifies upper limit on READ rate from the device. IO rate is
  blkio.throttle.read_bps_device
	  Specifies upper limit on READ rate from the device. IO rate is
	  specified in bytes per second. Rules are per device. Following is
	  the format::

	    echo "<major>:<minor>  <rate_bytes_per_second>" > /cgrp/blkio.throttle.read_bps_device

- blkio.throttle.write_bps_device
	- Specifies upper limit on WRITE rate to the device. IO rate is
  blkio.throttle.write_bps_device
	  Specifies upper limit on WRITE rate to the device. IO rate is
	  specified in bytes per second. Rules are per device. Following is
	  the format::

	    echo "<major>:<minor>  <rate_bytes_per_second>" > /cgrp/blkio.throttle.write_bps_device

- blkio.throttle.read_iops_device
	- Specifies upper limit on READ rate from the device. IO rate is
  blkio.throttle.read_iops_device
	  Specifies upper limit on READ rate from the device. IO rate is
	  specified in IO per second. Rules are per device. Following is
	  the format::

	   echo "<major>:<minor>  <rate_io_per_second>" > /cgrp/blkio.throttle.read_iops_device

- blkio.throttle.write_iops_device
	- Specifies upper limit on WRITE rate to the device. IO rate is
  blkio.throttle.write_iops_device
	  Specifies upper limit on WRITE rate to the device. IO rate is
	  specified in io per second. Rules are per device. Following is
	  the format::

@@ -275,15 +280,15 @@ Throttling/Upper limit policy files
          Note: If both BW and IOPS rules are specified for a device, then IO is
          subjected to both the constraints.

- blkio.throttle.io_serviced
	- Number of IOs (bio) issued to the disk by the group. These
  blkio.throttle.io_serviced
	  Number of IOs (bio) issued to the disk by the group. These
	  are further divided by the type of operation - read or write, sync
	  or async. First two fields specify the major and minor number of the
	  device, third field specifies the operation type and the fourth field
	  specifies the number of IOs.

- blkio.throttle.io_service_bytes
	- Number of bytes transferred to/from the disk by the group. These
  blkio.throttle.io_service_bytes
	  Number of bytes transferred to/from the disk by the group. These
	  are further divided by the type of operation - read or write, sync
	  or async. First two fields specify the major and minor number of the
	  device, third field specifies the operation type and the fourth field
@@ -291,6 +296,6 @@ Note: If both BW and IOPS rules are specified for a device, then IO is

Common files among various policies
-----------------------------------
- blkio.reset_stats
	- Writing an int to this file will result in resetting all the stats
  blkio.reset_stats
	  Writing an int to this file will result in resetting all the stats
	  for that cgroup.
+55 −0
Original line number Diff line number Diff line
@@ -56,6 +56,7 @@ v1 is available under :ref:`Documentation/admin-guide/cgroup-v1/index.rst <cgrou
       5-3-3. IO Latency
         5-3-3-1. How IO Latency Throttling Works
         5-3-3-2. IO Latency Interface Files
       5-3-4. IO Priority
     5-4. PID
       5-4-1. PID Interface Files
     5-5. Cpuset
@@ -1866,6 +1867,60 @@ IO Latency Interface Files
		duration of time between evaluation events.  Windows only elapse
		with IO activity.  Idle periods extend the most recent window.

IO Priority
~~~~~~~~~~~

A single attribute controls the behavior of the I/O priority cgroup policy,
namely the blkio.prio.class attribute. The following values are accepted for
that attribute:

  no-change
	Do not modify the I/O priority class.

  none-to-rt
	For requests that do not have an I/O priority class (NONE),
	change the I/O priority class into RT. Do not modify
	the I/O priority class of other requests.

  restrict-to-be
	For requests that do not have an I/O priority class or that have I/O
	priority class RT, change it into BE. Do not modify the I/O priority
	class of requests that have priority class IDLE.

  idle
	Change the I/O priority class of all requests into IDLE, the lowest
	I/O priority class.

The following numerical values are associated with the I/O priority policies:

+-------------+---+
| no-change   | 0 |
+-------------+---+
| none-to-rt  | 1 |
+-------------+---+
| rt-to-be    | 2 |
+-------------+---+
| all-to-idle | 3 |
+-------------+---+

The numerical value that corresponds to each I/O priority class is as follows:

+-------------------------------+---+
| IOPRIO_CLASS_NONE             | 0 |
+-------------------------------+---+
| IOPRIO_CLASS_RT (real-time)   | 1 |
+-------------------------------+---+
| IOPRIO_CLASS_BE (best effort) | 2 |
+-------------------------------+---+
| IOPRIO_CLASS_IDLE             | 3 |
+-------------------------------+---+

The algorithm to set the I/O priority class for a request is as follows:

- Translate the I/O priority class policy into a number.
- Change the request I/O priority class into the maximum of the I/O priority
  class policy number and the numerical I/O priority class.

PID
---

+0 −11
Original line number Diff line number Diff line
@@ -101,17 +101,6 @@ this results in concentration of disk activity in a small time interval which
occurs only once every 10 minutes, or whenever the disk is forced to spin up by
a cache miss. The disk can then be spun down in the periods of inactivity.

If you want to find out which process caused the disk to spin up, you can
gather information by setting the flag /proc/sys/vm/block_dump. When this flag
is set, Linux reports all disk read and write operations that take place, and
all block dirtyings done to files. This makes it possible to debug why a disk
needs to spin up, and to increase battery life even more. The output of
block_dump is written to the kernel output, and it can be retrieved using
"dmesg". When you use block_dump and your kernel logging level also includes
kernel debugging messages, you probably want to turn off klogd, otherwise
the output of block_dump will be logged, causing disk activity that is not
normally there.


Configuration
-------------
+0 −8
Original line number Diff line number Diff line
@@ -25,7 +25,6 @@ files can be found in mm/swap.c.
Currently, these files are in /proc/sys/vm:

- admin_reserve_kbytes
- block_dump
- compact_memory
- compaction_proactiveness
- compact_unevictable_allowed
@@ -106,13 +105,6 @@ On x86_64 this is about 128MB.
Changing this takes effect whenever an application requests memory.


block_dump
==========

block_dump enables block I/O debugging when set to a nonzero value. More
information on block I/O debugging is in Documentation/admin-guide/laptops/laptop-mode.rst.


compact_memory
==============

+27 −11
Original line number Diff line number Diff line
@@ -553,14 +553,21 @@ throughput sustainable with bfq, because updating the blkio.bfq.*
stats is rather costly, especially for some of the stats enabled by
CONFIG_BFQ_CGROUP_DEBUG.

Parameters to set
-----------------
Parameters
----------

For each group, the following parameters can be set:

  weight
        This specifies the default weight for the cgroup inside its parent.
        Available values: 1..1000 (default: 100).

For each group, there is only the following parameter to set.
        For cgroup v1, it is set by writing the value to `blkio.bfq.weight`.

weight (namely blkio.bfq.weight or io.bfq-weight): the weight of the
group inside its parent. Available values: 1..1000 (default 100). The
linear mapping between ioprio and weights, described at the beginning
        For cgroup v2, it is set by writing the value to `io.bfq.weight`.
        (with an optional prefix of `default` and a space).

        The linear mapping between ioprio and weights, described at the beginning
        of the tunable section, is still valid, but all weights higher than
        IOPRIO_BE_NR*10 are mapped to ioprio 0.

@@ -568,6 +575,15 @@ Recall that, if low-latency is set, then BFQ automatically raises the
        weight of the queues associated with interactive and soft real-time
        applications. Unset this tunable if you need/want to control weights.

  weight_device
        This specifies a per-device weight for the cgroup. The syntax is
        `minor:major weight`. A weight of `0` may be used to reset to the default
        weight.

        For cgroup v1, it is set by writing the value to `blkio.bfq.weight_device`.

        For cgroup v2, the file name is `io.bfq.weight`.


[1]
    P. Valente, A. Avanzini, "Evolution of the BFQ Storage I/O
Loading