Skip to content
  1. Aug 05, 2022
    • Nathan Huckleberry's avatar
      dm verity: Add optional "try_verify_in_tasklet" feature · 5721d4e5
      Nathan Huckleberry authored
      
      
      Using tasklets for disk verification can reduce IO latency. When there
      are accelerated hash instructions it is often better to compute the
      hash immediately using a tasklet rather than deferring verification to
      a work-queue. This reduces time spent waiting to schedule work-queue
      jobs, but requires spending slightly more time in interrupt context.
      
      If the dm-bufio cache does not have the required hashes we fallback to
      the work-queue implementation. FEC is only possible using work-queue
      because code to support the FEC feature may sleep.
      
      The following shows a speed comparison of random reads on a dm-verity
      device. The dm-verity device uses a 1G ramdisk for data and a 1G
      ramdisk for hashes. One test was run using tasklets and one test was
      run using the existing work-queue solution. Both tests were run when
      the dm-bufio cache was hot. The tasklet implementation performs
      significantly better since there is no time spent waiting for
      work-queue jobs to be scheduled.
      
         READ: bw=181MiB/s (190MB/s), 181MiB/s-181MiB/s (190MB/s-190MB/s),
         io=512MiB (537MB), run=2827-2827msec
         READ: bw=23.6MiB/s (24.8MB/s), 23.6MiB/s-23.6MiB/s (24.8MB/s-24.8MB/s),
         io=512MiB (537MB), run=21688-21688msec
      
      Signed-off-by: default avatarNathan Huckleberry <nhuck@google.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@kernel.org>
      5721d4e5
  2. Jul 29, 2022
    • Nathan Huckleberry's avatar
      dm bufio: Add DM_BUFIO_CLIENT_NO_SLEEP flag · b32d4582
      Nathan Huckleberry authored
      
      
      Add an optional flag that ensures dm_bufio_client does not sleep
      (primary focus is to service dm_bufio_get without sleeping). This
      allows the dm-bufio cache to be queried from interrupt context.
      
      To ensure that dm-bufio does not sleep, dm-bufio must use a spinlock
      instead of a mutex. Additionally, to avoid deadlocks, special care
      must be taken so that dm-bufio does not sleep while holding the
      spinlock.
      
      But again: the scope of this no_sleep is initially confined to
      dm_bufio_get, so __alloc_buffer_wait_no_callback is _not_ changed to
      avoid sleeping because __bufio_new avoids allocation for NF_GET.
      
      Signed-off-by: default avatarNathan Huckleberry <nhuck@google.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@kernel.org>
      b32d4582
    • Nathan Huckleberry's avatar
      dm bufio: Add flags argument to dm_bufio_client_create · 0fcb100d
      Nathan Huckleberry authored
      
      
      Add a flags argument to dm_bufio_client_create and update all the
      callers. This is in preparation to add the DM_BUFIO_NO_SLEEP flag.
      
      Signed-off-by: default avatarNathan Huckleberry <nhuck@google.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@kernel.org>
      0fcb100d
    • Mike Snitzer's avatar
      dm: fix dm-raid crash if md_handle_request() splits bio · 9dd1cd32
      Mike Snitzer authored
      Commit ca522482 ("dm: pass NULL bdev to bio_alloc_clone")
      introduced the optimization to _not_ perform bio_associate_blkg()'s
      relatively costly work when DM core clones its bio. But in doing so it
      exposed the possibility for DM's cloned bio to alter DM target
      behavior (e.g. crash) if a target were to issue IO without first
      calling bio_set_dev().
      
      The DM raid target can trigger an MD crash due to its need to split
      the DM bio that is passed to md_handle_request(). The split will
      recurse to submit_bio_noacct() using a bio with an uninitialized
      ->bi_blkg. This NULL bio->bi_blkg causes blk_throtl_bio() to
      dereference a NULL blkg_to_tg(bio->bi_blkg).
      
      Fix this in DM core by adding a new 'needs_bio_set_dev' target flag that
      will make alloc_tio() call bio_set_dev() on behalf of the target.
      dm-raid is the only target that requires this flag. bio_set_dev()
      initializes the DM cloned bio's ->bi_blkg, using bio_associate_blkg,
      before passing the bio to md_handle_request().
      
      Long-term fix would be to audit and refactor MD code to rely on DM to
      split its bio, using dm_accept_partial_bio(), but there are MD raid
      personalities (e.g. raid1 and raid10) whose implementation are tightly
      coupled to handling the bio splitting inline.
      
      Fixes: ca522482
      
       ("dm: pass NULL bdev to bio_alloc_clone")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarMike Snitzer <snitzer@kernel.org>
      9dd1cd32
    • Mikulas Patocka's avatar
      dm raid: fix address sanitizer warning in raid_resume · 7dad24db
      Mikulas Patocka authored
      
      
      There is a KASAN warning in raid_resume when running the lvm test
      lvconvert-raid.sh. The reason for the warning is that mddev->raid_disks
      is greater than rs->raid_disks, so the loop touches one entry beyond
      the allocated length.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@kernel.org>
      7dad24db
    • Mikulas Patocka's avatar
      dm raid: fix address sanitizer warning in raid_status · 1fbeea21
      Mikulas Patocka authored
      There is this warning when using a kernel with the address sanitizer
      and running this testsuite:
      https://gitlab.com/cki-project/kernel-tests/-/tree/main/storage/swraid/scsi_raid
      
      
      
      ==================================================================
      BUG: KASAN: slab-out-of-bounds in raid_status+0x1747/0x2820 [dm_raid]
      Read of size 4 at addr ffff888079d2c7e8 by task lvcreate/13319
      CPU: 0 PID: 13319 Comm: lvcreate Not tainted 5.18.0-0.rc3.<snip> #1
      Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
      Call Trace:
       <TASK>
       dump_stack_lvl+0x6a/0x9c
       print_address_description.constprop.0+0x1f/0x1e0
       print_report.cold+0x55/0x244
       kasan_report+0xc9/0x100
       raid_status+0x1747/0x2820 [dm_raid]
       dm_ima_measure_on_table_load+0x4b8/0xca0 [dm_mod]
       table_load+0x35c/0x630 [dm_mod]
       ctl_ioctl+0x411/0x630 [dm_mod]
       dm_ctl_ioctl+0xa/0x10 [dm_mod]
       __x64_sys_ioctl+0x12a/0x1a0
       do_syscall_64+0x5b/0x80
      
      The warning is caused by reading conf->max_nr_stripes in raid_status. The
      code in raid_status reads mddev->private, casts it to struct r5conf and
      reads the entry max_nr_stripes.
      
      However, if we have different raid type than 4/5/6, mddev->private
      doesn't point to struct r5conf; it may point to struct r0conf, struct
      r1conf, struct r10conf or struct mpconf. If we cast a pointer to one
      of these structs to struct r5conf, we will be reading invalid memory
      and KASAN warns about it.
      
      Fix this bug by reading struct r5conf only if raid type is 4, 5 or 6.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@kernel.org>
      1fbeea21
    • Mike Christie's avatar
      dm: Start pr_preempt from the same starting path · c6adada5
      Mike Christie authored
      
      
      pr_preempt has a similar issue as reserve where for all the
      reservation types except the All Registrants ones the preempt can
      create a reservation. And a follow up reservation or release needs to
      go down the same path the preempt did. This has the pr_preempt work
      like reserve and release where we always start from the first path in
      the first group.
      
      This commit has been tested with windows failover clustering's
      validation test and libiscsi's PGR tests to check for regressions.
      They both don't have tests to verify this case, so I tested it
      manually.
      
      Signed-off-by: default avatarMike Christie <michael.christie@oracle.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@kernel.org>
      c6adada5
    • Mike Christie's avatar
      dm: Fix PR release handling for non All Registrants · 08a3c338
      Mike Christie authored
      
      
      This commit fixes a bug where we are leaving the reservation in place
      even though pr_release has run and returned success.
      
      If we have a Write Exclusive, Exclusive Access, or Write/Exclusive
      Registrants only reservation, the release must be sent down the path
      that is the reservation holder. The problem is multipath_prepare_ioctl
      most likely selected path N for the reservation, then later when we do
      the release multipath_prepare_ioctl will select a completely different
      path. The device will then return success becuase the nvme and scsi
      specs say to return success if there is no reservation or if the
      release is sent down from a path that is not the holder. We then think
      we have released the reservation.
      
      This commit has us loop over each path and send a release so we can
      make sure the release is executed on the correct path. It has been
      tested with windows failover clustering's validation test which checks
      this case, and it has been tested manually (the libiscsi PGR tests
      don't have a test case for this yet, but I will be adding one).
      
      Signed-off-by: default avatarMike Christie <michael.christie@oracle.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@kernel.org>
      08a3c338
    • Mike Christie's avatar
      dm: Start pr_reserve from the same starting path · 70151087
      Mike Christie authored
      
      
      When an app does a pr_reserve it will go to whatever path we happen to
      be using at the time. This can result in errors when the app does a
      second pr_reserve call and expects success but gets a failure because
      the reserve is not done on the holder's path. This commit has us
      always start trying to do reserves from the first path in the first
      group.
      
      Windows failover clustering will produce the type of pattern above.
      With this commit, we will now pass its validation test for this case.
      
      Signed-off-by: default avatarMike Christie <michael.christie@oracle.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@kernel.org>
      70151087
    • Mike Christie's avatar
      dm: Allow dm_call_pr to be used for path searches · 8dd87f3c
      Mike Christie authored
      
      
      The specs state that if you send a reserve down a path that is already
      the holder success must be returned and if it goes down a path that
      is not the holder reservation conflict must be returned. Windows
      failover clustering will send a second reservation and expects that a
      device returns success. The problem for multipathing is that for an
      All Registrants reservation, we can send the reserve down any path but
      for all other reservation types there is one path that is the holder.
      
      To handle this we could add PR state to dm but that can get nasty.
      Look at target_core_pr.c for an example of the type of things we'd
      have to track. It will also get more complicated because other
      initiators can change the state so we will have to add in async
      event/sense handling.
      
      This commit, and the 3 commits that follow, tries to keep dm simple
      and keep just doing passthrough. This commit modifies dm_call_pr to be
      able to find the first usable path that can execute our pr_op then
      return. When dm_pr_reserve is converted to dm_call_pr in the next
      commit for the normal case we will use the same path for every
      reserve.
      
      Signed-off-by: default avatarMike Christie <michael.christie@oracle.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@kernel.org>
      8dd87f3c
    • Mike Snitzer's avatar
      dm: return early from dm_pr_call() if DM device is suspended · e120a5f1
      Mike Snitzer authored
      Otherwise PR ops may be issued while the broader DM device is being
      reconfigured, etc.
      
      Fixes: 9c72bad1
      
       ("dm: call PR reserve/unreserve on each underlying device")
      Signed-off-by: default avatarMike Snitzer <snitzer@kernel.org>
      e120a5f1
  3. Jul 16, 2022
    • Luo Meng's avatar
      dm thin: fix use-after-free crash in dm_sm_register_threshold_callback · 3534e5a5
      Luo Meng authored
      Fault inject on pool metadata device reports:
        BUG: KASAN: use-after-free in dm_pool_register_metadata_threshold+0x40/0x80
        Read of size 8 at addr ffff8881b9d50068 by task dmsetup/950
      
        CPU: 7 PID: 950 Comm: dmsetup Tainted: G        W         5.19.0-rc6 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-1.fc33 04/01/2014
        Call Trace:
         <TASK>
         dump_stack_lvl+0x34/0x44
         print_address_description.constprop.0.cold+0xeb/0x3f4
         kasan_report.cold+0xe6/0x147
         dm_pool_register_metadata_threshold+0x40/0x80
         pool_ctr+0xa0a/0x1150
         dm_table_add_target+0x2c8/0x640
         table_load+0x1fd/0x430
         ctl_ioctl+0x2c4/0x5a0
         dm_ctl_ioctl+0xa/0x10
         __x64_sys_ioctl+0xb3/0xd0
         do_syscall_64+0x35/0x80
         entry_SYSCALL_64_after_hwframe+0x46/0xb0
      
      This can be easily reproduced using:
        echo offline > /sys/block/sda/device/state
        dd if=/dev/zero of=/dev/mapper/thin bs=4k count=10
        dmsetup load pool --table "0 20971520 thin-pool /dev/sda /dev/sdb 128 0 0"
      
      If a metadata commit fails, the transaction will be aborted and the
      metadata space maps will be destroyed. If a DM table reload then
      happens for this failed thin-pool, a use-after-free will occur in
      dm_sm_register_threshold_callback (called from
      dm_pool_register_metadata_threshold).
      
      Fix this by in dm_pool_register_metadata_threshold() by returning the
      -EINVAL error if the thin-pool is in fail mode. Also fail pool_ctr()
      with a new error message: "Error registering metadata threshold".
      
      Fixes: ac8c3f3d
      
       ("dm thin: generate event when metadata threshold passed")
      Cc: stable@vger.kernel.org
      Reported-by: default avatarHulk Robot <hulkci@huawei.com>
      Signed-off-by: default avatarLuo Meng <luomeng12@huawei.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@kernel.org>
      3534e5a5
  4. Jul 15, 2022
  5. Jul 08, 2022
  6. Jul 07, 2022
    • Zhang Jiaming's avatar
      962c6296
    • Jiang Jian's avatar
    • Steven Lung's avatar
      dm cache: fix typo in 2 comment blocks · 5c29e784
      Steven Lung authored
      
      
      Replace neccessarily with necessarily.
      
      Signed-off-by: default avatarSteven Lung <1030steven@gmail.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@kernel.org>
      5c29e784
    • JeongHyeon Lee's avatar
      dm verity: fix checkpatch close brace error · 20e6fc85
      JeongHyeon Lee authored
      
      
      Resolves: ERROR: else should follow close brace '}'
      
      Signed-off-by: default avatarJeongHyeon Lee <jhs2.lee@samsung.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@kernel.org>
      20e6fc85
    • Mike Snitzer's avatar
      dm table: rename dm_target variable in dm_table_add_target() · 899ab445
      Mike Snitzer authored
      
      
      Rename from "tgt" to "ti" so that all of dm-table.c code uses the same
      naming for dm_target variables.
      
      Signed-off-by: default avatarMike Snitzer <snitzer@kernel.org>
      899ab445
    • Mike Snitzer's avatar
      dm table: audit all dm_table_get_target() callers · 564b5c54
      Mike Snitzer authored
      
      
      All callers of dm_table_get_target() are expected to do proper bounds
      checking on the index they pass.
      
      Move dm_table_get_target() to dm-core.h to make it extra clear that only
      DM core code should be using it. Switch it to be inlined while at it.
      
      Standardize all DM core callers to use the same for loop pattern and
      make associated variables as local as possible. Rename some variables
      (e.g. s/table/t/ and s/tgt/ti/) along the way.
      
      Signed-off-by: default avatarMike Snitzer <snitzer@kernel.org>
      564b5c54
    • Mike Snitzer's avatar
      dm table: remove dm_table_get_num_targets() wrapper · 2aec377a
      Mike Snitzer authored
      
      
      More efficient and readable to just access table->num_targets directly.
      
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarMike Snitzer <snitzer@kernel.org>
      2aec377a
    • Ming Lei's avatar
      dm: add two stage requeue mechanism · 8b211aac
      Ming Lei authored
      Commit 61b6e2e5 ("dm: fix BLK_STS_DM_REQUEUE handling when dm_io
      represents split bio") reverted DM core's bio splitting back to using
      bio_split()+bio_chain() because it was found that otherwise DM's
      BLK_STS_DM_REQUEUE would trigger a live-lock waiting for bio
      completion that would never occur.
      
      Restore using bio_trim()+bio_inc_remaining(), like was done in commit
      7dd76d1f ("dm: improve bio splitting and associated IO
      accounting"), but this time with proper handling for the above
      scenario that is covered in more detail in the commit header for
      61b6e2e5
      
      .
      
      Solve this issue by adding a two staged dm_io requeue mechanism that
      uses the new dm_bio_rewind() via dm_io_rewind():
      
      1) requeue the dm_io into the requeue_list added to struct
         mapped_device, and schedule it via new added requeue work. This
         workqueue just clones the dm_io->orig_bio (which DM saves and
         ensures its end sector isn't modified). dm_io_rewind() uses the
         sectors and sectors_offset members of the dm_io that are recorded
         relative to the end of orig_bio: dm_bio_rewind()+bio_trim() are
         then used to make that cloned bio reflect the subset of the
         original bio that is represented by the dm_io that is being
         requeued.
      
      2) the 2nd stage requeue is same with original requeue, but
         io->orig_bio points to new cloned bio (which matches the requeued
         dm_io as described above).
      
      This allows DM core to shift the need for bio cloning from bio-split
      time (during IO submission) to the less likely BLK_STS_DM_REQUEUE
      handling (after IO completes with that error).
      
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@kernel.org>
      8b211aac
    • Ming Lei's avatar
      dm: add dm_bio_rewind() API to DM core · 61cbe788
      Ming Lei authored
      Commit 7759eb23 ("block: remove bio_rewind_iter()") removed
      a similar API for the following reasons:
          ```
          It is pointed that bio_rewind_iter() is one very bad API[1]:
      
          1) bio size may not be restored after rewinding
      
          2) it causes some bogus change, such as 5151842b (block: reset
          bi_iter.bi_done after splitting bio)
      
          3) rewinding really makes things complicated wrt. bio splitting
      
          4) unnecessary updating of .bi_done in fast path
      
          [1] https://marc.info/?t=153549924200005&r=1&w=2
      
          So this patch takes Kent's suggestion to restore one bio into its original
          state via saving bio iterator(struct bvec_iter) in bio_integrity_prep(),
          given now bio_rewind_iter() is only used by bio integrity code.
          ```
      However, saving off a copy of the 32 bytes bio->bi_iter in case rewind
      needed isn't efficient because it bloats per-bio-data for what is an
      unlikely case. That suggestion also ignores the need to restore
      crypto and integrity info.
      
      Add dm_bio_rewind() API for a specific use-case that is much more narrow
      than the previous more generic rewind code that was reverted:
      
      1) most bios have a fixed end sector since bio split is done from front
         of the bio, if driver just records how many sectors between current
         bio's start sector and the original bio's end sector, the original
         position can be restored. Keeping the original bio's end sector
         fixed is a _hard_ requirement for this interface!
      
      2) if a bio's end sector won't change (usually bio_trim() isn't
         called, or in the case of DM it preserves original bio), user can
         restore the original position by storing sector offset from the
         current ->bi_iter.bi_sector to bio's end sector; together with
         saving bio size, only 8 bytes is needed to restore to original
         bio.
      
      3) DM's requeue use case: when BLK_STS_DM_REQUEUE happens, DM core
         needs to restore to an "original bio" which represents the current
         dm_io to be requeued (which may be a subset of the original bio).
         By storing the sector offset from the original bio's end sector and
         dm_io's size, dm_bio_rewind() can restore such original bio. See
         commit 7dd76d1f
      
       ("dm: improve bio splitting and associated IO
         accounting") for more details on how DM does this. Leveraging this,
         allows DM core to shift the need for bio cloning from bio-split
         time (during IO submission) to the less likely BLK_STS_DM_REQUEUE
         handling (after IO completes with that error).
      
      4) Unlike the original rewind API, dm_bio_rewind() doesn't add .bi_done
         to bvec_iter and there is no effect on the fast path.
      
      Implement dm_bio_rewind() by factoring out clear helpers that it calls:
      dm_bio_integrity_rewind, dm_bio_crypt_rewind and dm_bio_rewind_iter.
      
      DM is able to ensure that dm_bio_rewind() is used safely but, given
      the constraint that the bio's end must never change, other
      hypothetical future callers may not take the same care. So make
      dm_bio_rewind() and all supporting code local to DM to avoid risk of
      hypothetical abuse. A "dm_" prefix was added to all functions to avoid
      any namespace collisions.
      
      Suggested-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@kernel.org>
      61cbe788
  7. Jun 30, 2022
  8. Jun 29, 2022