Skip to content
  1. Dec 15, 2021
  2. Dec 14, 2021
    • Tejun Heo's avatar
      iocost: Fix divide-by-zero on donation from low hweight cgroup · edaa2633
      Tejun Heo authored
      The donation calculation logic assumes that the donor has non-zero
      after-donation hweight, so the lowest active hweight a donating cgroup can
      have is 2 so that it can donate 1 while keeping the other 1 for itself.
      Earlier, we only donated from cgroups with sizable surpluses so this
      condition was always true. However, with the precise donation algorithm
      implemented, f1de2439 ("blk-iocost: revamp donation amount
      determination") made the donation amount calculation exact enabling even low
      hweight cgroups to donate.
      
      This means that in rare occasions, a cgroup with active hweight of 1 can
      enter donation calculation triggering the following warning and then a
      divide-by-zero oops.
      
       WARNING: CPU: 4 PID: 0 at block/blk-iocost.c:1928 transfer_surpluses.cold+0x0/0x53 [884/94867]
       ...
       RIP: 0010:transfer_surpluses.cold+0x0/0x53
       Code: 92 ff 48 c7 c7 28 d1 ab b5 65 48 8b 34 25 00 ae 01 00 48 81 c6 90 06 00 00 e8 8b 3f fe ff 48 c7 c0 ea ff ff ff e9 95...
      edaa2633
  3. Dec 11, 2021
  4. Dec 10, 2021
    • Jens Axboe's avatar
      Merge tag 'nvme-5.16-2021-12-10' of git://git.infradead.org/nvme into block-5.16 · 091f06d9
      Jens Axboe authored
      Pull NVMe fixes from Christoph:
      
      "nvme fixes for Linux 5.16
      
       - set ana_log_size to 0 after freeing ana_log_buf (Hou Tao)
       - show subsys nqn for duplicate cntlids (Keith Busch)
       - disable namespace access for unsupported metadata (Keith Busch)
       - report write pointer for a full zone as zone start + zone len
         (Niklas Cassel)
       - fix use after free when disconnecting a reconnecting ctrl
         (Ruozhu Li)
       - fix a list corruption in nvmet-tcp (Sagi Grimberg)"
      
      * tag 'nvme-5.16-2021-12-10' of git://git.infradead.org/nvme:
        nvmet-tcp: fix possible list corruption for unexpected command failure
        nvme: fix use after free when disconnecting a reconnecting ctrl
        nvme-multipath: set ana_log_size to 0 after free ana_log_buf
        nvme: report write pointer for a full zone as zone start + zone len
        nvme: disable namespace access for unsupported metadata
        nvme: show subsys nqn for duplicate cntlids
      091f06d9
  5. Dec 08, 2021
  6. Dec 07, 2021
  7. Dec 06, 2021
    • Niklas Cassel's avatar
      nvme: report write pointer for a full zone as zone start + zone len · 793fcab8
      Niklas Cassel authored
      
      
      The write pointer in NVMe ZNS is invalid for a zone in zone state full.
      The same also holds true for ZAC/ZBC.
      
      The current behavior for NVMe is to simply propagate the wp reported by
      the drive, even for full zones. Since the wp is invalid for a full zone,
      the wp reported by the drive may be any value.
      
      The way that the sd_zbc driver handles a full zone is to always report
      the wp as zone start + zone len, regardless of what the drive reported.
      null_blk also follows this convention.
      
      Do the same for NVMe, so that a BLKREPORTZONE ioctl reports the write
      pointer for a full zone in a consistent way, regardless of the interface
      of the underlying zoned block device.
      
      blkzone report before patch:
      start: 0x000040000, len 0x040000, cap 0x03e000, wptr 0xfffffffffffbfff8
      reset:0 non-seq:0, zcond:14(fu) [type: 2(SEQ_WRITE_REQUIRED)]
      
      blkzone report after patch:
      start: 0x000040000, len 0x040000, cap 0x03e000, wptr 0x040000 reset:0
      non-seq:0, zcond:14(fu) [type: 2(SEQ_WRITE_REQUIRED)]
      
      Signed-off-by: default avatarNiklas Cassel <niklas.cassel@wdc.com>
      Reviewed-by: default avatarKeith Busch <kbusch@kernel.org>
      Reviewed-by: default avatarDamien Le Moal <damien.lemoal@opensource.wdc.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      793fcab8
    • Keith Busch's avatar
      nvme: disable namespace access for unsupported metadata · d39ad2a4
      Keith Busch authored
      
      
      The only fabrics target that supports metadata handling through the
      separate integrity buffer is RDMA. It is currently usable only if the
      size is 8B per block and formatted for protection information. If an
      rdma target were to export a namespace with a different format (ex:
      4k+64B), the driver will not be able to submit valid read/write commands
      for that namespace.
      
      Suppress setting the metadata feature in the namespace so that the
      gendisk capacity will be set to 0. This will prevent read/write access
      through the block stack, but will continue to allow ioctl passthrough
      commands.
      
      Cc: Max Gurtovoy <mgurtovoy@nvidia.com>
      Cc: Sagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarKeith Busch <kbusch@kernel.org>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      d39ad2a4
    • Keith Busch's avatar
      nvme: show subsys nqn for duplicate cntlids · 16cc33b2
      Keith Busch authored
      The driver assigned nvme handle isn't persistent across reboots, so is
      not enough information to match up where the collisions are occuring.
      Add the subsys nqn string to the output so that it can more easily be
      identified later.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=215099
      
      
      Signed-off-by: default avatarKeith Busch <kbusch@kernel.org>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      16cc33b2
  8. Nov 29, 2021
  9. Nov 27, 2021
  10. Nov 26, 2021
  11. Nov 25, 2021
    • Jens Axboe's avatar
      Merge tag 'nvme-5.16-2021-11-25' of git://git.infradead.org/nvme into block-5.16 · 3fd40fa2
      Jens Axboe authored
      Pull NVMe fixes from Christoph:
      
      "nvme fixes for Linux 5.16
      
       - add a NO APST quirk for a Kioxia device (Enzo Matsumiya)
       - fix write zeroes pi (Klaus Jensen)
       - various TCP transport fixes (Maurizio Lombardi and Varun Prakash)
       - ignore invalid fast_io_fail_tmo values (Maurizio Lombardi)
       - use IOCB_NOWAIT only if the filesystem supports it (Maurizio Lombardi)"
      
      * tag 'nvme-5.16-2021-11-25' of git://git.infradead.org/nvme:
        nvmet: use IOCB_NOWAIT only if the filesystem supports it
        nvme: fix write zeroes pi
        nvme-fabrics: ignore invalid fast_io_fail_tmo values
        nvme-pci: add NO APST quirk for Kioxia device
        nvme-tcp: fix memory leak when freeing a queue
        nvme-tcp: validate R2T PDU in nvme_tcp_handle_r2t()
        nvmet-tcp: fix incomplete data digest send
        nvmet-tcp: fix memory leak when performing a controller reset
        nvmet-tcp: add an helper to free the cmd buffers
        nvmet-tcp: fix a race condition between release_queue and io_work
      3fd40fa2
    • Maurizio Lombardi's avatar
      nvmet: use IOCB_NOWAIT only if the filesystem supports it · c024b226
      Maurizio Lombardi authored
      
      
      Submit I/O requests with the IOCB_NOWAIT flag set only if
      the underlying filesystem supports it.
      
      Fixes: 50a909db ("nvmet: use IOCB_NOWAIT for file-ns buffered I/O")
      Signed-off-by: default avatarMaurizio Lombardi <mlombard@redhat.com>
      Reviewed-by: default avatarChaitanya Kulkarni <kch@nvidia.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      c024b226
  12. Nov 24, 2021
  13. Nov 23, 2021
  14. Nov 19, 2021
  15. Nov 17, 2021
  16. Nov 16, 2021
    • Ming Lei's avatar
      blk-mq: cancel blk-mq dispatch work in both blk_cleanup_queue and disk_release() · 2a19b28f
      Ming Lei authored
      
      
      For avoiding to slow down queue destroy, we don't call
      blk_mq_quiesce_queue() in blk_cleanup_queue(), instead of delaying to
      cancel dispatch work in blk_release_queue().
      
      However, this way has caused kernel oops[1], reported by Changhui. The log
      shows that scsi_device can be freed before running blk_release_queue(),
      which is expected too since scsi_device is released after the scsi disk
      is closed and the scsi_device is removed.
      
      Fixes the issue by canceling blk-mq dispatch work in both blk_cleanup_queue()
      and disk_release():
      
      1) when disk_release() is run, the disk has been closed, and any sync
      dispatch activities have been done, so canceling dispatch work is enough to
      quiesce filesystem I/O dispatch activity.
      
      2) in blk_cleanup_queue(), we only focus on passthrough request, and
      passthrough request is always explicitly allocated & freed by
      its caller, so once queue is frozen, all sync dispatch activity
      for passthrough request has been done, then it is enough to just cancel
      dispatch work for avoiding any dispatch activity.
      
      [1] kernel panic log
      [12622.769416] BUG: kernel NULL pointer dereference, address: 0000000000000300
      [12622.777186] #PF: supervisor read access in kernel mode
      [12622.782918] #PF: error_code(0x0000) - not-present page
      [12622.788649] PGD 0 P4D 0
      [12622.791474] Oops: 0000 [#1] PREEMPT SMP PTI
      [12622.796138] CPU: 10 PID: 744 Comm: kworker/10:1H Kdump: loaded Not tainted 5.15.0+ #1
      [12622.804877] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 10/002/2015
      [12622.813321] Workqueue: kblockd blk_mq_run_work_fn
      [12622.818572] RIP: 0010:sbitmap_get+0x75/0x190
      [12622.823336] Code: 85 80 00 00 00 41 8b 57 08 85 d2 0f 84 b1 00 00 00 45 31 e4 48 63 cd 48 8d 1c 49 48 c1 e3 06 49 03 5f 10 4c 8d 6b 40 83 f0 01 <48> 8b 33 44 89 f2 4c 89 ef 0f b6 c8 e8 fa f3 ff ff 83 f8 ff 75 58
      [12622.844290] RSP: 0018:ffffb00a446dbd40 EFLAGS: 00010202
      [12622.850120] RAX: 0000000000000001 RBX: 0000000000000300 RCX: 0000000000000004
      [12622.858082] RDX: 0000000000000006 RSI: 0000000000000082 RDI: ffffa0b7a2dfe030
      [12622.866042] RBP: 0000000000000004 R08: 0000000000000001 R09: ffffa0b742721334
      [12622.874003] R10: 0000000000000008 R11: 0000000000000008 R12: 0000000000000000
      [12622.881964] R13: 0000000000000340 R14: 0000000000000000 R15: ffffa0b7a2dfe030
      [12622.889926] FS:  0000000000000000(0000) GS:ffffa0baafb40000(0000) knlGS:0000000000000000
      [12622.898956] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [12622.905367] CR2: 0000000000000300 CR3: 0000000641210001 CR4: 00000000001706e0
      [12622.913328] Call Trace:
      [12622.916055]  <TASK>
      [12622.918394]  scsi_mq_get_budget+0x1a/0x110
      [12622.922969]  __blk_mq_do_dispatch_sched+0x1d4/0x320
      [12622.928404]  ? pick_next_task_fair+0x39/0x390
      [12622.933268]  __blk_mq_sched_dispatch_requests+0xf4/0x140
      [12622.939194]  blk_mq_sched_dispatch_requests+0x30/0x60
      [12622.944829]  __blk_mq_run_hw_queue+0x30/0xa0
      [12622.949593]  process_one_work+0x1e8/0x3c0
      [12622.954059]  worker_thread+0x50/0x3b0
      [12622.958144]  ? rescuer_thread+0x370/0x370
      [12622.962616]  kthread+0x158/0x180
      [12622.966218]  ? set_kthread_struct+0x40/0x40
      [12622.970884]  ret_from_fork+0x22/0x30
      [12622.974875]  </TASK>
      [12622.977309] Modules linked in: scsi_debug rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs sunrpc dm_multipath intel_rapl_msr intel_rapl_common dell_wmi_descriptor sb_edac rfkill video x86_pkg_temp_thermal intel_powerclamp dcdbas coretemp kvm_intel kvm mgag200 irqbypass i2c_algo_bit rapl drm_kms_helper ipmi_ssif intel_cstate intel_uncore syscopyarea sysfillrect sysimgblt fb_sys_fops pcspkr cec mei_me lpc_ich mei ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter drm fuse xfs libcrc32c sr_mod cdrom sd_mod t10_pi sg ixgbe ahci libahci crct10dif_pclmul crc32_pclmul crc32c_intel libata megaraid_sas ghash_clmulni_intel tg3 wdat_wdt mdio dca wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_debug]
      
      Reported-by: default avatarChanghuiZhong <czhong@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: linux-scsi@vger.kernel.org
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Link: https://lore.kernel.org/r/20211116014343.610501-1-ming.lei@redhat.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2a19b28f
    • Jens Axboe's avatar
      block: fix missing queue put in error path · 95febeb6
      Jens Axboe authored
      If we fail the submission queue checks, we don't put the queue afterwards.
      This can cause various issues like stalls on scheduler switch or failure
      to remove the device, or like in the original bug report, timeout waiting
      for the device on reboot/restart.
      
      While in there, fix a few whitespace discrepancies in the surrounding
      code.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=215039
      
      
      Fixes: b637108a ("blk-mq: fix filesystem I/O request allocation")
      Reported-and-tested-by: default avatarStephen Smith <stephenmsmith@blueyonder.co.uk>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      95febeb6
    • Alistair Delva's avatar
      block: Check ADMIN before NICE for IOPRIO_CLASS_RT · 94c4b4fd
      Alistair Delva authored
      
      
      Booting to Android userspace on 5.14 or newer triggers the following
      SELinux denial:
      
      avc: denied { sys_nice } for comm="init" capability=23
           scontext=u:r:init:s0 tcontext=u:r:init:s0 tclass=capability
           permissive=0
      
      Init is PID 0 running as root, so it already has CAP_SYS_ADMIN. For
      better compatibility with older SEPolicy, check ADMIN before NICE.
      
      Fixes: 9d3a39a5 ("block: grant IOPRIO_CLASS_RT to CAP_SYS_NICE")
      Signed-off-by: default avatarAlistair Delva <adelva@google.com>
      Cc: Khazhismel Kumykov <khazhy@google.com>
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: Serge Hallyn <serge@hallyn.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Paul Moore <paul@paul-moore.com>
      Cc: selinux@vger.kernel.org
      Cc: linux-security-module@vger.kernel.org
      Cc: kernel-team@android.com
      Cc: stable@vger.kernel.org # v5.14+
      Reviewed-by: default avatarBart Van Assche <bvanassche@acm.org>
      Acked-by: default avatarSerge Hallyn <serge@hallyn.com>
      Link: https://lore.kernel.org/r/20211115181655.3608659-1-adelva@google.com
      
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      94c4b4fd
  17. Nov 15, 2021