Skip to content
  1. Sep 15, 2023
  2. Sep 13, 2023
  3. Aug 22, 2023
  4. Aug 21, 2023
  5. Aug 18, 2023
    • Jens Axboe's avatar
      Merge tag 'md-next-20230817' of... · eb051b2d
      Jens Axboe authored
      Merge tag 'md-next-20230817' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.6/block
      
      Pull MD changes from Song:
      
      "1. Fix perf regression for raid0 large sequential writes, by Jan Kara.
       2. Fix split bio iostat for raid0, by David Jeffery.
       3. Various raid1 fixes, by Heinz Mauelshagen and Xueshi Hu."
      
      * tag 'md-next-20230817' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
        md: raid0: account for split bio in iostat accounting
        md/raid0: Fix performance regression for large sequential writes
        md/raid0: Factor out helper for mapping and submitting a bio
        md raid1: allow writebehind to work on any leg device set WriteMostly
        md/raid1: hold the barrier until handle_read_error() finishes
        md/raid1: free the r1bio before waiting for blocked rdev
        md/raid1: call free_r1bio() before allow_barrier() in raid_end_bio_io()
      eb051b2d
    • David Jeffery's avatar
      md: raid0: account for split bio in iostat accounting · cc22b540
      David Jeffery authored
      When a bio is split by md raid0, the newly created bio will not be tracked
      by md for I/O accounting. Only the portion of I/O still assigned to the
      original bio which was reduced by the split will be accounted for. This
      results in md iostat data sometimes showing I/O values far below the actual
      amount of data being sent through md.
      
      md_account_bio() needs to be called for all bio generated by the bio split.
      
      A simple example of the issue was generated using a raid0 device on partitions
      to the same device. Since all raid0 I/O then goes to one device, it makes it
      easy to see a gap between the md device and its sd storage. Reading an lvm
      device on top of the md device, the iostat output (some 0 columns and extra
      devices removed to make the data more compact) was:
      
      Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read
      md2               0.00         0.00         0.00         0.00          0
      sde               0.00         0.00         0.00         0.00          0
      md2            1364.00    411496.00         0.00         0.00     411496
      sde            1734.00    646144.00         0.00         0.00     646144
      md2            1699.00    510680.00         0.00         0.00     510680
      sde            2155.00    802784.00         0.00         0.00     802784
      md2             803.00    241480.00         0.00         0.00     241480
      sde            1016.00    377888.00         0.00         0.00     377888
      md2               0.00         0.00         0.00         0.00          0
      sde               0.00         0.00         0.00         0.00          0
      
      I/O was generated doing large direct I/O reads (12M) with dd to a linear
      lvm volume on top of the 4 leg raid0 device.
      
      The md2 reads were showing as roughly 2/3 of the reads to the sde device
      containing all of md2's raid partitions. The sum of reads to sde was
      1826816 kB, which was the expected amount as it was the amount read by
      dd. With the patch, the total reads from md will match the reads from
      sde and be consistent with the amount of I/O generated.
      
      Fixes: 10764815
      
       ("md: add io accounting for raid0 and raid5")
      Signed-off-by: default avatarDavid Jeffery <djeffery@redhat.com>
      Tested-by: default avatarLaurence Oberman <loberman@redhat.com>
      Reviewed-by: default avatarLaurence Oberman <loberman@redhat.com>
      Reviewed-by: default avatarYu Kuai <yukuai3@huawei.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      Link: https://lore.kernel.org/r/20230816181433.13289-1-djeffery@redhat.com
      cc22b540
    • Jan Kara's avatar
      md/raid0: Fix performance regression for large sequential writes · 319ff40a
      Jan Kara authored
      Commit f00d7c85 ("md/raid0: fix up bio splitting.") among other
      things changed how bio that needs to be split is submitted. Before this
      commit, we have split the bio, mapped and submitted each part. After
      this commit, we map only the first part of the split bio and submit the
      second part unmapped. Due to bio sorting in __submit_bio_noacct() this
      results in the following request ordering:
      
        9,0   18     1181     0.525037895 15995  Q  WS 1479315464 + 63392
      
        Split off chunk-sized (1024 sectors) request:
      
        9,0   18     1182     0.629019647 15995  X  WS 1479315464 / 1479316488
      
        Request is unaligned to the chunk so it's split in
        raid0_make_request().  This is the first part mapped and punted to
        bio_list:
      
        8,0   18     7053     0.629020455 15995  A  WS 739921928 + 1016 <- (9,0) 1479315464
      
        Now raid0_make_request() returns, second part is postponed on
        bio_list. __submit_bio_noacct() resorts the bio_list, mapped request
        is submitted to the underlying device:
      
        8,0   18     7054     0.629022782 15995  G  WS 739921928 + 1016
      
        Now we take another request from the bio_list which is the remainder
        of the original huge request. Split off another chunk-sized bit from
        it and the situation repeats:
      
        9,0   18     1183     0.629024499 15995  X  WS 1479316488 / 1479317512
        8,16  18     6998     0.629025110 15995  A  WS 739921928 + 1016 <- (9,0) 1479316488
        8,16  18     6999     0.629026728 15995  G  WS 739921928 + 1016
        ...
        9,0   18     1184     0.629032940 15995  X  WS 1479317512 / 1479318536 [libnetacq-write]
        8,0   18     7059     0.629033294 15995  A  WS 739922952 + 1016 <- (9,0) 1479317512
        8,0   18     7060     0.629033902 15995  G  WS 739922952 + 1016
        ...
      
        This repeats until we consume the whole original huge request. Now we
        finally get to processing the second parts of the split off requests
        (in reverse order):
      
        8,16  18     7181     0.629161384 15995  A  WS 739952640 + 8 <- (9,0) 1479377920
        8,0   18     7239     0.629162140 15995  A  WS 739952640 + 8 <- (9,0) 1479376896
        8,16  18     7186     0.629163881 15995  A  WS 739951616 + 8 <- (9,0) 1479375872
        8,0   18     7242     0.629164421 15995  A  WS 739951616 + 8 <- (9,0) 1479374848
        ...
      
      I guess it is obvious that this IO pattern is extremely inefficient way
      to perform sequential IO. It also makes bio_list to grow to rather long
      lengths.
      
      Change raid0_make_request() to map both parts of the split bio. Since we
      know we are provided with at most chunk-sized bios, we will always need
      to split the incoming bio at most once.
      
      Fixes: f00d7c85
      
       ("md/raid0: fix up bio splitting.")
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarYu Kuai <yukuai3@huawei.com>
      Link: https://lore.kernel.org/r/20230814092720.3931-2-jack@suse.cz
      
      
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      319ff40a
    • Jan Kara's avatar
      md/raid0: Factor out helper for mapping and submitting a bio · af50e20a
      Jan Kara authored
      
      
      Factor out helper function for mapping and submitting a bio out of
      raid0_make_request(). We will use it later for submitting both parts of
      a split bio.
      
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarYu Kuai <yukuai3@huawei.com>
      Link: https://lore.kernel.org/r/20230814092720.3931-1-jack@suse.cz
      
      
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      af50e20a
    • Heinz Mauelshagen's avatar
      md raid1: allow writebehind to work on any leg device set WriteMostly · 6b2460e6
      Heinz Mauelshagen authored
      
      
      As the WriteMostly flag can be set on any component device of a RAID1
      array, remove the constraint that it only works if set on the first one.
      
      Signed-off-by: default avatarHeinz Mauelshagen <heinzm@redhat.com>
      Tested-by: default avatarXiao Ni <xni@redhat.com>
      Link: https://lore.kernel.org/r/2a9592bf3340f34bf588eec984b23ee219f3985e.1692013451.git.heinzm@redhat.com
      
      
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      6b2460e6
    • Xueshi Hu's avatar
      md/raid1: hold the barrier until handle_read_error() finishes · c069da44
      Xueshi Hu authored
      handle_read_error() will call allow_barrier() to match the former barrier
      raising. However, it should put the allow_barrier() at the end to avoid a
      concurrent raid reshape.
      
      Fixes: 689389a0
      
       ("md/raid1: simplify handle_read_error().")
      Reviewed-by: default avatarYu Kuai <yukuai3@huawei.com>
      Signed-off-by: default avatarXueshi Hu <xueshi.hu@smartx.com>
      Link: https://lore.kernel.org/r/20230814135356.1113639-4-xueshi.hu@smartx.com
      
      
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      c069da44
    • Xueshi Hu's avatar
      md/raid1: free the r1bio before waiting for blocked rdev · 992db13a
      Xueshi Hu authored
      Raid1 reshape will change mempool and r1conf::raid_disks which are
      needed to free r1bio. allow_barrier() make a concurrent raid1_reshape()
      possible. So, free the in-flight r1bio before waiting blocked rdev.
      
      Fixes: 6bfe0b49
      
       ("md: support blocking writes to an array on device failure")
      Reviewed-by: default avatarYu Kuai <yukuai3@huawei.com>
      Signed-off-by: default avatarXueshi Hu <xueshi.hu@smartx.com>
      Link: https://lore.kernel.org/r/20230814135356.1113639-3-xueshi.hu@smartx.com
      
      
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      992db13a
    • Xueshi Hu's avatar
      md/raid1: call free_r1bio() before allow_barrier() in raid_end_bio_io() · c5d736f5
      Xueshi Hu authored
      
      
      After allow_barrier, a concurrent raid1_reshape() will replace old mempool
      and r1conf::raid_disks. Move allow_barrier() to the end of raid_end_bio_io(),
      so that r1bio can be freed safely.
      
      Reviewed-by: default avatarYu Kuai <yukuai3@huawei.com>
      Signed-off-by: default avatarXueshi Hu <xueshi.hu@smartx.com>
      Link: https://lore.kernel.org/r/20230814135356.1113639-2-xueshi.hu@smartx.com
      
      
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      c5d736f5
    • Tejun Heo's avatar
      blk-cgroup: Fix NULL deref caused by blkg_policy_data being installed before init · ec14a87e
      Tejun Heo authored
      
      
      blk-iocost sometimes causes the following crash:
      
        BUG: kernel NULL pointer dereference, address: 00000000000000e0
        ...
        RIP: 0010:_raw_spin_lock+0x17/0x30
        Code: be 01 02 00 00 e8 79 38 39 ff 31 d2 89 d0 5d c3 0f 1f 00 0f 1f 44 00 00 55 48 89 e5 65 ff 05 48 d0 34 7e b9 01 00 00 00 31 c0 <f0> 0f b1 0f 75 02 5d c3 89 c6 e8 ea 04 00 00 5d c3 0f 1f 84 00 00
        RSP: 0018:ffffc900023b3d40 EFLAGS: 00010046
        RAX: 0000000000000000 RBX: 00000000000000e0 RCX: 0000000000000001
        RDX: ffffc900023b3d20 RSI: ffffc900023b3cf0 RDI: 00000000000000e0
        RBP: ffffc900023b3d40 R08: ffffc900023b3c10 R09: 0000000000000003
        R10: 0000000000000064 R11: 000000000000000a R12: ffff888102337000
        R13: fffffffffffffff2 R14: ffff88810af408c8 R15: ffff8881070c3600
        FS:  00007faaaf364fc0(0000) GS:ffff88842fdc0000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00000000000000e0 CR3: 00000001097b1000 CR4: 0000000000350ea0
        Call Trace:
         <TASK>
         ioc_weight_write+0x13d/0x410
         cgroup_file_write+0x7a/0x130
         kernfs_fop_write_iter+0xf5/0x170
         vfs_write+0x298/0x370
         ksys_write+0x5f/0xb0
         __x64_sys_write+0x1b/0x20
         do_syscall_64+0x3d/0x80
         entry_SYSCALL_64_after_hwframe+0x46/0xb0
      
      This happens because iocg->ioc is NULL. The field is initialized by
      ioc_pd_init() and never cleared. The NULL deref is caused by
      blkcg_activate_policy() installing blkg_policy_data before initializing it.
      
      blkcg_activate_policy() was doing the following:
      
      1. Allocate pd's for all existing blkg's and install them in blkg->pd[].
      2. Initialize all pd's.
      3. Online all pd's.
      
      blkcg_activate_policy() only grabs the queue_lock and may release and
      re-acquire the lock as allocation may need to sleep. ioc_weight_write()
      grabs blkcg->lock and iterates all its blkg's. The two can race and if
      ioc_weight_write() runs during #1 or between #1 and #2, it can encounter a
      pd which is not initialized yet, leading to crash.
      
      The crash can be reproduced with the following script:
      
        #!/bin/bash
      
        echo +io > /sys/fs/cgroup/cgroup.subtree_control
        systemd-run --unit touch-sda --scope dd if=/dev/sda of=/dev/null bs=1M count=1 iflag=direct
        echo 100 > /sys/fs/cgroup/system.slice/io.weight
        bash -c "echo '8:0 enable=1' > /sys/fs/cgroup/io.cost.qos" &
        sleep .2
        echo 100 > /sys/fs/cgroup/system.slice/io.weight
      
      with the following patch applied:
      
      > diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
      > index fc49be622e05..38d671d5e10c 100644
      > --- a/block/blk-cgroup.c
      > +++ b/block/blk-cgroup.c
      > @@ -1553,6 +1553,12 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol)
      > 		pd->online = false;
      > 	}
      >
      > +       if (system_state == SYSTEM_RUNNING) {
      > +               spin_unlock_irq(&q->queue_lock);
      > +               ssleep(1);
      > +               spin_lock_irq(&q->queue_lock);
      > +       }
      > +
      > 	/* all allocated, init in the same order */
      > 	if (pol->pd_init_fn)
      > 		list_for_each_entry_reverse(blkg, &q->blkg_list, q_node)
      
      I don't see a reason why all pd's should be allocated, initialized and
      onlined together. The only ordering requirement is that parent blkgs to be
      initialized and onlined before children, which is guaranteed from the
      walking order. Let's fix the bug by allocating, initializing and onlining pd
      for each blkg and holding blkcg->lock over initialization and onlining. This
      ensures that an installed blkg is always fully initialized and onlined
      removing the the race window.
      
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarBreno Leitao <leitao@debian.org>
      Fixes: 9d179b86 ("blkcg: Fix multiple bugs in blkcg_activate_policy()")
      Link: https://lore.kernel.org/r/ZN0p5_W-Q9mAHBVY@slm.duckdns.org
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ec14a87e
  6. Aug 16, 2023
  7. Aug 15, 2023
  8. Aug 11, 2023
  9. Aug 10, 2023