Skip to content
  1. Apr 27, 2021
    • Jens Axboe's avatar
      Merge branch 'md-next' of... · 8324fbae
      Jens Axboe authored
      Merge branch 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-5.13/drivers
      
      Pull MD related fix from Song:
      
      "This change fixes raid5 on POWER8."
      
      * 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
        async_xor: increase src_offs when dropping destination page
      8324fbae
    • Xiao Ni's avatar
      async_xor: increase src_offs when dropping destination page · ceaf2966
      Xiao Ni authored
      Now we support sharing one page if PAGE_SIZE is not equal stripe size. To
      support this, it needs to support calculating xor value with different
      offsets for each r5dev. One offset array is used to record those offsets.
      
      In RMW mode, parity page is used as a source page. It sets
      ASYNC_TX_XOR_DROP_DST before calculating xor value in ops_run_prexor5.
      So it needs to add src_list and src_offs at the same time. Now it only
      needs src_list. So the xor value which is calculated is wrong. It can
      cause data corruption problem.
      
      I can reproduce this problem 100% on a POWER8 machine. The steps are:
      
        mdadm -CR /dev/md0 -l5 -n3 /dev/sdb1 /dev/sdc1 /dev/sdd1 --size=3G
        mkfs.xfs /dev/md0
        mount /dev/md0 /mnt/test
        mount: /mnt/test: mount(2) system call failed: Structure needs cleaning.
      
      Fixes: 29bcff78
      
       ("md/raid5: add new xor function to support different page offset")
      Cc: stable@vger.kernel.org # v5.10+
      Signed-off-by: default avatarXiao Ni <xni@redhat.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      ceaf2966
  2. Apr 26, 2021
    • Lv Yunlong's avatar
      drivers/block/null_blk/main: Fix a double free in null_init. · 72ce11dd
      Lv Yunlong authored
      In null_init, null_add_dev(dev) is called.
      In null_add_dev, it calls null_free_zoned_dev(dev) to free dev->zones
      via kvfree(dev->zones) in out_cleanup_zone branch and returns err.
      Then null_init accept the err code and then calls null_free_dev(dev).
      
      But in null_free_dev(dev), dev->zones is freed again by
      null_free_zoned_dev().
      
      My patch set dev->zones to NULL in null_free_zoned_dev() after
      kvfree(dev->zones) is called, to avoid the double free.
      
      Fixes: 2984c868
      
       ("nullb: factor disk parameters")
      Signed-off-by: default avatarLv Yunlong <lyl2019@mail.ustc.edu.cn>
      Link: https://lore.kernel.org/r/20210426143229.7374-1-lyl2019@mail.ustc.edu.cn
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      72ce11dd
  3. Apr 24, 2021
    • Jens Axboe's avatar
      Merge branch 'md-next' of... · b8417f72
      Jens Axboe authored
      Merge branch 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-5.13/drivers
      
      Pull MD fixes from Song.
      
      * 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
        md/raid1: properly indicate failure when ending a failed write request
        md-cluster: fix use-after-free issue when removing rdev
      b8417f72
    • Paul Clements's avatar
      md/raid1: properly indicate failure when ending a failed write request · 2417b986
      Paul Clements authored
      This patch addresses a data corruption bug in raid1 arrays using bitmaps.
      Without this fix, the bitmap bits for the failed I/O end up being cleared.
      
      Since we are in the failure leg of raid1_end_write_request, the request
      either needs to be retried (R1BIO_WriteError) or failed (R1BIO_Degraded).
      
      Fixes: eeba6809
      
       ("md/raid1: end bio when the device faulty")
      Cc: stable@vger.kernel.org # v5.2+
      Signed-off-by: default avatarPaul Clements <paul.clements@us.sios.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      2417b986
    • Heming Zhao's avatar
      md-cluster: fix use-after-free issue when removing rdev · f7c7a2f9
      Heming Zhao authored
      md_kick_rdev_from_array will remove rdev, so we should
      use rdev_for_each_safe to search list.
      
      How to trigger:
      
      env: Two nodes on kvm-qemu x86_64 VMs (2C2G with 2 iscsi luns).
      
      ```
      node2=192.168.0.3
      
      for i in {1..20}; do
          echo ==== $i `date` ====;
      
          mdadm -Ss && ssh ${node2} "mdadm -Ss"
          wipefs -a /dev/sda /dev/sdb
      
          mdadm -CR /dev/md0 -b clustered -e 1.2 -n 2 -l 1 /dev/sda \
             /dev/sdb --assume-clean
          ssh ${node2} "mdadm -A /dev/md0 /dev/sda /dev/sdb"
          mdadm --wait /dev/md0
          ssh ${node2} "mdadm --wait /dev/md0"
      
          mdadm --manage /dev/md0 --fail /dev/sda --remove /dev/sda
          sleep 1
      done
      ```
      
      Crash stack:
      
      ```
      stack segment: 0000 [#1] SMP
      ... ...
      RIP: 0010:md_check_recovery+0x1e8/0x570 [md_mod]
      ... ...
      RSP: 0018:ffffb149807a7d68 EFLAGS: 00010207
      RAX: 0000000000000000 RBX: ffff9d494c180800 RCX: ffff9d490fc01e50
      RDX: fffff047c0ed8308 RSI: 0000000000000246 RDI: 0000000000000246
      RBP: 6b6b6b6b6b6b6b6b R08: ffff9d490fc01e40 R09: 0000000000000000
      R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
      R13: ffff9d494c180818 R14: ffff9d493399ef38 R15: ffff9d4933a1d800
      FS:  0000000000000000(0000) GS:ffff9d494f700000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007fe68cab9010 CR3: 000000004c6be001 CR4: 00000000003706e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       raid1d+0x5c/0xd40 [raid1]
       ? finish_task_switch+0x75/0x2a0
       ? lock_timer_base+0x67/0x80
       ? try_to_del_timer_sync+0x4d/0x80
       ? del_timer_sync+0x41/0x50
       ? schedule_timeout+0x254/0x2d0
       ? md_start_sync+0xe0/0xe0 [md_mod]
       ? md_thread+0x127/0x160 [md_mod]
       md_thread+0x127/0x160 [md_mod]
       ? wait_woken+0x80/0x80
       kthread+0x10d/0x130
       ? kthread_park+0xa0/0xa0
       ret_from_fork+0x1f/0x40
      ```
      
      Fixes: dbb64f86 ("md-cluster: Fix adding of new disk with new reload code")
      Fixes: 659b254f
      
       ("md-cluster: remove a disk asynchronously from cluster environment")
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarGang He <ghe@suse.com>
      Signed-off-by: default avatarHeming Zhao <heming.zhao@suse.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      f7c7a2f9
  4. Apr 23, 2021
    • Jens Axboe's avatar
      Merge tag 'nvme-5.13-2021-04-22' of git://git.infradead.org/nvme into for-5.13/drivers · 87d9ad02
      Jens Axboe authored
      Pull NVMe updates from Christoph:
      
      "- add support for a per-namespace character device (Minwoo Im)
       - various KATO fixes and cleanups (Hou Pu, Hannes Reinecke)
       - APST fix and cleanup"
      
      * tag 'nvme-5.13-2021-04-22' of git://git.infradead.org/nvme:
        nvme: introduce generic per-namespace chardev
        nvme: cleanup nvme_configure_apst
        nvme: do not try to reconfigure APST when the controller is not live
        nvme: add 'kato' sysfs attribute
        nvme: sanitize KATO setting
        nvmet: avoid queuing keep-alive timer if it is disabled
      87d9ad02
  5. Apr 22, 2021
    • Minwoo Im's avatar
      nvme: introduce generic per-namespace chardev · 2637baed
      Minwoo Im authored
      
      
      Userspace has not been allowed to I/O to device that's failed to
      be initialized.  This patch introduces generic per-namespace character
      device to allow userspace to I/O regardless the block device is there or
      not.
      
      The chardev naming convention will similar to the existing blkdev naming,
      using a ng prefix instead of nvme, i.e.
      
      	- /dev/ngXnY
      
      It also supports multipath which means it will not expose chardev for the
      hidden namespace blkdevs (e.g., nvmeXcYnZ).  If /dev/ngXnY is created for
      a ns_head, then I/O request will be routed to a specific controller
      selected by the iopolicy of the subsystem.
      
      Signed-off-by: default avatarMinwoo Im <minwoo.im.dev@gmail.com>
      Signed-off-by: default avatarJavier González <javier.gonz@samsung.com>
      Reviewed-by: default avatarKeith Busch <kbusch@kernel.org>
      Tested-by: default avatarKanchan Joshi <joshi.k@samsung.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      2637baed
    • Christoph Hellwig's avatar
      nvme: cleanup nvme_configure_apst · 60df5de9
      Christoph Hellwig authored
      
      
      Remove a level of indentation from the main code implementating the table
      search by using a goto for the APST not supported case.  Also move the
      main comment above the function.
      
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarNiklas Cassel <niklas.cassel@wdc.com>
      60df5de9
    • Christoph Hellwig's avatar
      nvme: do not try to reconfigure APST when the controller is not live · 53fe2a30
      Christoph Hellwig authored
      Do not call nvme_configure_apst when the controller is not live, given
      that nvme_configure_apst will fail due the lack of an admin queue when
      the controller is being torn down and nvme_set_latency_tolerance is
      called from dev_pm_qos_hide_latency_tolerance.
      
      Fixes: 510a405d
      
      ("nvme: fix memory leak for power latency tolerance")
      Reported-by: default avatarPeng Liu <liupeng17@lenovo.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarKeith Busch <kbusch@kernel.org>
      53fe2a30
    • Hannes Reinecke's avatar
      nvme: add 'kato' sysfs attribute · 74c22990
      Hannes Reinecke authored
      
      
      Add a 'kato' controller sysfs attribute to display the current
      keep-alive timeout value (if any). This allows userspace to identify
      persistent discovery controllers, as these will have a non-zero
      KATO value.
      
      Signed-off-by: default avatarHannes Reinecke <hare@suse.de>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      74c22990
    • Hannes Reinecke's avatar
      nvme: sanitize KATO setting · a70b81bd
      Hannes Reinecke authored
      
      
      According to the NVMe base spec the KATO commands should be sent
      at half of the KATO interval, to properly account for round-trip
      times.
      As we now will only ever send one KATO command per connection we
      can easily use the recommended values.
      This also fixes a potential issue where the request timeout for
      the KATO command does not match the value in the connect command,
      which might be causing spurious connection drops from the target.
      
      Signed-off-by: default avatarHannes Reinecke <hare@suse.de>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      a70b81bd
    • Hou Pu's avatar
      nvmet: avoid queuing keep-alive timer if it is disabled · 8f864c59
      Hou Pu authored
      
      
      Issue following command:
      nvme set-feature -f 0xf -v 0 /dev/nvme1n1 # disable keep-alive timer
      nvme admin-passthru -o 0x18 /dev/nvme1n1  # send keep-alive command
      will make keep-alive timer fired and thus delete the controller like
      below:
      
      [247459.907635] nvmet: ctrl 1 keep-alive timer (0 seconds) expired!
      [247459.930294] nvmet: ctrl 1 fatal error occurred!
      
      Avoid this by not queuing delayed keep-alive if it is disabled when
      keep-alive command is received from the admin queue.
      
      Signed-off-by: default avatarHou Pu <houpu.main@gmail.com>
      Tested-by: default avatarChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      8f864c59
    • Calvin Owens's avatar
      brd: expose number of allocated pages in debugfs · f4be591f
      Calvin Owens authored
      
      
      While the maximum size of each ramdisk is defined either as a module
      parameter, or compile time default, it's impossible to know how many pages
      have currently been allocated by each ram%d device, since they're
      allocated when used and never freed.
      
      This patch creates a new directory at this location:
      
      /sys/kernel/debug/ramdisk_pages/
      
      which will contain a file named "ram%d" for each instantiated ramdisk on
      the system. The file is read-only, and read() will output the number of
      pages currently held by that ramdisk.
      
      We lose track how much memory a ramdisk is using as pages once used are
      simply recycled but never freed.
      
      In instances where we exhaust the size of the ramdisk with a file that
      exceeds it, encounter ENOSPC and delete the file for mitigation; df would
      show decrease in used and increase in available blocks but the since we
      have touched all pages, the memory footprint of the ramdisk does not
      reflect the blocks used/available count
      
      ...
      [root@localhost ~]# mkfs.ext2 /dev/ram15
      mke2fs 1.45.6 (20-Mar-2020)
      Creating filesystem with 4096 1k blocks and 1024 inodes
      [root@localhost ~]# mount /dev/ram15 /mnt/ram15/
      
      [root@localhost ~]# cat
      /sys/kernel/debug/ramdisk_pages/ram15
      58
      [root@kerneltest008.06.prn3 ~]# df /dev/ram15
      Filesystem     1K-blocks  Used Available Use% Mounted on
      /dev/ram15          3963    31      3728   1% /mnt/ram15
      [root@kerneltest008.06.prn3 ~]# dd if=/dev/urandom of=/mnt/ram15/test2
      bs=1M count=5
      dd: error writing '/mnt/ram15/test2': No space left on device
      4+0 records in
      3+0 records out
      4005888 bytes (4.0 MB, 3.8 MiB) copied, 0.0446614 s, 89.7 MB/s
      [root@kerneltest008.06.prn3 ~]# df /mnt/ram15/
      Filesystem     1K-blocks  Used Available Use% Mounted on
      /dev/ram15          3963  3960         0 100% /mnt/ram15
      [root@kerneltest008.06.prn3 ~]# cat
      /sys/kernel/debug/ramdisk_pages/ram15
      1024
      [root@kerneltest008.06.prn3 ~]# rm /mnt/ram15/test2
      rm: remove regular file '/mnt/ram15/test2'? y
      [root@kerneltest008.06.prn3 /var]# df /dev/ram15
      Filesystem     1K-blocks  Used Available Use% Mounted on
      /dev/ram15          3963    31      3728   1% /mnt/ram15
      
      # Acutal memory footprint
      [root@kerneltest008.06.prn3 /var]# cat
      /sys/kernel/debug/ramdisk_pages/ram15
      1024
      ...
      
      This debugfs counter will always reveal the accurate number of
      permanently allocated pages to the ramdisk.
      
      Signed-off-by: default avatarCalvin Owens <calvinowens@fb.com>
      [cleaned up the !CONFIG_DEBUG_FS case and API changes for HEAD]
      Signed-off-by: default avatarKyle McMartin <jkkm@fb.com>
      [rebased]
      Signed-off-by: default avatarSaravanan D <saravanand@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f4be591f
  6. Apr 21, 2021
  7. Apr 20, 2021