Skip to content
  1. Nov 23, 2023
    • Arnd Bergmann's avatar
      nvme: target: fix Kconfig select statements · 65e2a74c
      Arnd Bergmann authored
      
      
      When the NVME target code is built-in but its TCP frontend is a loadable
      module, enabling keyring support causes a link failure:
      
      x86_64-linux-ld: vmlinux.o: in function `nvmet_ports_make':
      configfs.c:(.text+0x100a211): undefined reference to `nvme_keyring_id'
      
      The problem is that CONFIG_NVME_TARGET_TCP_TLS is a 'bool' symbol that
      depends on the tristate CONFIG_NVME_TARGET_TCP, so any 'select' from
      it inherits the state of the tristate symbol rather than the intended
      CONFIG_NVME_TARGET one that contains the actual call.
      
      The same thing is true for CONFIG_KEYS, which itself is required for
      NVME_KEYRING.
      
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Link: https://lore.kernel.org/r/20231122224719.4042108-3-arnd@kernel.org
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      65e2a74c
    • Arnd Bergmann's avatar
      nvme: target: fix nvme_keyring_id() references · d78abcba
      Arnd Bergmann authored
      
      
      In configurations without CONFIG_NVME_TARGET_TCP_TLS, the keyring
      code might not be available, or using it will result in a runtime
      failure:
      
      x86_64-linux-ld: vmlinux.o: in function `nvmet_ports_make':
      configfs.c:(.text+0x100a211): undefined reference to `nvme_keyring_id'
      
      Add a check to ensure we only check the keyring if there is a chance
      of it being used, which avoids both the runtime and link-time
      problems.
      
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Link: https://lore.kernel.org/r/20231122224719.4042108-2-arnd@kernel.org
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d78abcba
    • Jens Axboe's avatar
      Merge tag 'nvme-6.7-2023-11-22' of git://git.infradead.org/nvme into block-6.7 · 55072cd7
      Jens Axboe authored
      Pull NVMe fixes from Keith:
      
      "nvme fixes for Linux 6.7
      
       - TCP TLS fixes (Hannes)
       - Authentifaction fixes (Mark, Hannes)
       - Properly terminate target names (Christoph)"
      
      * tag 'nvme-6.7-2023-11-22' of git://git.infradead.org/nvme:
        nvme: move nvme_stop_keep_alive() back to original position
        nvmet-tcp: always initialize tls_handshake_tmo_work
        nvmet: nul-terminate the NQNs passed in the connect command
        nvme: blank out authentication fabrics options if not configured
        nvme: catch errors from nvme_configure_metadata()
        nvme-tcp: only evaluate 'tls' option if TLS is selected
        nvme-auth: set explanation code for failure2 msgs
        nvme-auth: unlock mutex in one place only
      55072cd7
    • Hannes Reinecke's avatar
      nvme: move nvme_stop_keep_alive() back to original position · 3af755a4
      Hannes Reinecke authored
      Stopping keep-alive not only stops the keep-alive workqueue,
      but also needs to be synchronized with I/O termination as we
      must not send a keep-alive command after all I/O had been
      terminated.
      So to avoid any regressions move the call to stop_keep_alive()
      back to its original position and ensure that keep-alive is
      correctly stopped failing to setup the admin queue.
      
      Fixes: 4733b65d
      
       ("nvme: start keep-alive after admin queue setup")
      Suggested-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarHannes Reinecke <hare@suse.de>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarKeith Busch <kbusch@kernel.org>
      3af755a4
  2. Nov 21, 2023
    • Li Nan's avatar
      nbd: pass nbd_sock to nbd_read_reply() instead of index · 98c598af
      Li Nan authored
      
      
      If a socket is processing ioctl 'NBD_SET_SOCK', config->socks might be
      krealloc in nbd_add_socket(), and a garbage request is received now, a UAF
      may occurs.
      
        T1
        nbd_ioctl
         __nbd_ioctl
          nbd_add_socket
           blk_mq_freeze_queue
      				T2
        				recv_work
        				 nbd_read_reply
        				  sock_xmit
           krealloc config->socks
      				   def config->socks
      
      Pass nbd_sock to nbd_read_reply(). And introduce a new function
      sock_xmit_recv(), which differs from sock_xmit only in the way it get
      socket.
      
      ==================================================================
      BUG: KASAN: use-after-free in sock_xmit+0x525/0x550
      Read of size 8 at addr ffff8880188ec428 by task kworker/u12:1/18779
      
      Workqueue: knbd4-recv recv_work
      Call Trace:
       __dump_stack
       dump_stack+0xbe/0xfd
       print_address_description.constprop.0+0x19/0x170
       __kasan_report.cold+0x6c/0x84
       kasan_report+0x3a/0x50
       sock_xmit+0x525/0x550
       nbd_read_reply+0xfe/0x2c0
       recv_work+0x1c2/0x750
       process_one_work+0x6b6/0xf10
       worker_thread+0xdd/0xd80
       kthread+0x30a/0x410
       ret_from_fork+0x22/0x30
      
      Allocated by task 18784:
       kasan_save_stack+0x1b/0x40
       kasan_set_track
       set_alloc_info
       __kasan_kmalloc
       __kasan_kmalloc.constprop.0+0xf0/0x130
       slab_post_alloc_hook
       slab_alloc_node
       slab_alloc
       __kmalloc_track_caller+0x157/0x550
       __do_krealloc
       krealloc+0x37/0xb0
       nbd_add_socket
       +0x2d3/0x880
       __nbd_ioctl
       nbd_ioctl+0x584/0x8e0
       __blkdev_driver_ioctl
       blkdev_ioctl+0x2a0/0x6e0
       block_ioctl+0xee/0x130
       vfs_ioctl
       __do_sys_ioctl
       __se_sys_ioctl+0x138/0x190
       do_syscall_64+0x33/0x40
       entry_SYSCALL_64_after_hwframe+0x61/0xc6
      
      Freed by task 18784:
       kasan_save_stack+0x1b/0x40
       kasan_set_track+0x1c/0x30
       kasan_set_free_info+0x20/0x40
       __kasan_slab_free.part.0+0x13f/0x1b0
       slab_free_hook
       slab_free_freelist_hook
       slab_free
       kfree+0xcb/0x6c0
       krealloc+0x56/0xb0
       nbd_add_socket+0x2d3/0x880
       __nbd_ioctl
       nbd_ioctl+0x584/0x8e0
       __blkdev_driver_ioctl
       blkdev_ioctl+0x2a0/0x6e0
       block_ioctl+0xee/0x130
       vfs_ioctl
       __do_sys_ioctl
       __se_sys_ioctl+0x138/0x190
       do_syscall_64+0x33/0x40
       entry_SYSCALL_64_after_hwframe+0x61/0xc6
      
      Signed-off-by: default avatarLi Nan <linan122@huawei.com>
      Reviewed-by: default avatarYu Kuai <yukuai3@huawei.com>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Link: https://lore.kernel.org/r/20230911023308.3467802-1-linan666@huaweicloud.com
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      98c598af
    • Jan Höppner's avatar
      s390/dasd: protect device queue against concurrent access · db46cd1e
      Jan Höppner authored
      In dasd_profile_start() the amount of requests on the device queue are
      counted. The access to the device queue is unprotected against
      concurrent access. With a lot of parallel I/O, especially with alias
      devices enabled, the device queue can change while dasd_profile_start()
      is accessing the queue. In the worst case this leads to a kernel panic
      due to incorrect pointer accesses.
      
      Fix this by taking the device lock before accessing the queue and
      counting the requests. Additionally the check for a valid profile data
      pointer can be done earlier to avoid unnecessary locking in a hot path.
      
      Cc:  <stable@vger.kernel.org>
      Fixes: 4fa52aa7
      
       ("[S390] dasd: add enhanced DASD statistics interface")
      Reviewed-by: default avatarStefan Haberland <sth@linux.ibm.com>
      Signed-off-by: default avatarJan Höppner <hoeppner@linux.ibm.com>
      Signed-off-by: default avatarStefan Haberland <sth@linux.ibm.com>
      Link: https://lore.kernel.org/r/20231025132437.1223363-3-sth@linux.ibm.com
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      db46cd1e
    • Muhammad Muzammil's avatar
      s390/dasd: resolve spelling mistake · 5029c5e4
      Muhammad Muzammil authored
      
      
      resolve typing mistake from pimary to primary
      
      Signed-off-by: default avatarMuhammad Muzammil <m.muzzammilashraf@gmail.com>
      Link: https://lore.kernel.org/r/20231010043140.28416-1-m.muzzammilashraf@gmail.com
      Signed-off-by: default avatarStefan Haberland <sth@linux.ibm.com>
      Link: https://lore.kernel.org/r/20231025132437.1223363-2-sth@linux.ibm.com
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      5029c5e4
    • Chengming Zhou's avatar
      block/null_blk: Fix double blk_mq_start_request() warning · 53f2bca2
      Chengming Zhou authored
      
      
      When CONFIG_BLK_DEV_NULL_BLK_FAULT_INJECTION is enabled, null_queue_rq()
      would return BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE for the request,
      which has been marked as MQ_RQ_IN_FLIGHT by blk_mq_start_request().
      
      Then null_queue_rqs() put these requests in the rqlist, return back to
      the block layer core, which would try to queue them individually again,
      so the warning in blk_mq_start_request() triggered.
      
      Fix it by splitting the null_queue_rq() into two parts: the first is the
      preparation of request, the second is the handling of request. We put
      the blk_mq_start_request() after the preparation part, which may fail
      and return back to the block layer core.
      
      The throttling also belongs to the preparation part, so move it before
      blk_mq_start_request(). And change the return type of null_handle_cmd()
      to void, since it always return BLK_STS_OK now.
      
      Reported-by: default avatar <syzbot+fcc47ba2476570cbbeb0@syzkaller.appspotmail.com>
      Closes: https://lore.kernel.org/all/0000000000000e6aac06098aee0c@google.com/
      Fixes: d78bfa13
      
       ("block/null_blk: add queue_rqs() support")
      Suggested-by: default avatarBart Van Assche <bvanassche@acm.org>
      Signed-off-by: default avatarChengming Zhou <zhouchengming@bytedance.com>
      Link: https://lore.kernel.org/r/20231120032521.1012037-1-chengming.zhou@linux.dev
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      53f2bca2
    • Hannes Reinecke's avatar
      nvmet-tcp: always initialize tls_handshake_tmo_work · 11b9d0b4
      Hannes Reinecke authored
      The TLS handshake timeout work item should always be
      initialized to avoid a crash when cancelling the workqueue.
      
      Fixes: 675b453e
      
       ("nvmet-tcp: enable TLS handshake upcall")
      Suggested-by: default avatarMaurizio Lombardi <mlombard@redhat.com>
      Signed-off-by: default avatarHannes Reinecke <hare@suse.de>
      Tested-by: default avatarShin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
      Tested-by: default avatarYi Zhang <yi.zhang@redhat.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarKeith Busch <kbusch@kernel.org>
      11b9d0b4
    • Christoph Hellwig's avatar
      nvmet: nul-terminate the NQNs passed in the connect command · 1c22e029
      Christoph Hellwig authored
      The host and subsystem NQNs are passed in the connect command payload and
      interpreted as nul-terminated strings.  Ensure they actually are
      nul-terminated before using them.
      
      Fixes: a07b4970
      
       "nvmet: add a generic NVMe target")
      Reported-by: default avatarAlon Zahavi <zahavi.alon@gmail.com>
      Reviewed-by: default avatarChaitanya Kulkarni <kch@nvidia.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarKeith Busch <kbusch@kernel.org>
      1c22e029
    • Hannes Reinecke's avatar
      nvme: blank out authentication fabrics options if not configured · c7ca9757
      Hannes Reinecke authored
      If the config option NVME_HOST_AUTH is not selected we should not
      accept the corresponding fabrics options. This allows userspace
      to detect if NVMe authentication has been enabled for the kernel.
      
      Cc: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
      Fixes: f50fff73
      
       ("nvme: implement In-Band authentication")
      Signed-off-by: default avatarHannes Reinecke <hare@suse.de>
      Tested-by: default avatarShin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
      Reviewed-by: default avatarDaniel Wagner <dwagner@suse.de>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarKeith Busch <kbusch@kernel.org>
      c7ca9757
    • Hannes Reinecke's avatar
      nvme: catch errors from nvme_configure_metadata() · cd9aed60
      Hannes Reinecke authored
      
      
      nvme_configure_metadata() is issuing I/O, so we might incur an I/O
      error which will cause the connection to be reset.
      But in that case any further probing will race with reset and
      cause UAF errors.
      So return a status from nvme_configure_metadata() and abort
      probing if there was an I/O error.
      
      Signed-off-by: default avatarHannes Reinecke <hare@suse.de>
      Signed-off-by: default avatarKeith Busch <kbusch@kernel.org>
      cd9aed60
    • Hannes Reinecke's avatar
      nvme-tcp: only evaluate 'tls' option if TLS is selected · 23441536
      Hannes Reinecke authored
      We only need to evaluate the 'tls' connect option if TLS is
      enabled; otherwise we might be getting a link error.
      
      Fixes: 706add13
      
       ("nvme: keyring: fix conditional compilation")
      Reported-by: default avatarkernel test robot <yujie.liu@intel.com>
      Closes: https://lore.kernel.org/r/202311140426.0eHrTXBr-lkp@intel.com/
      Signed-off-by: default avatarHannes Reinecke <hare@suse.de>
      Signed-off-by: default avatarKeith Busch <kbusch@kernel.org>
      23441536
    • Mark O'Donovan's avatar
      nvme-auth: set explanation code for failure2 msgs · 38ce1570
      Mark O'Donovan authored
      
      
      Some error cases were not setting an auth-failure-reason-code-explanation.
      This means an AUTH_Failure2 message will be sent with an explanation value
      of 0 which is a reserved value.
      
      Signed-off-by: default avatarMark O'Donovan <shiftee@posteo.net>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarKeith Busch <kbusch@kernel.org>
      38ce1570
    • Mark O'Donovan's avatar
      616add70
    • Damien Le Moal's avatar
      block: Remove blk_set_runtime_active() · c96b8175
      Damien Le Moal authored
      
      
      The function blk_set_runtime_active() is called only from
      blk_post_runtime_resume(), so there is no need for that function to be
      exported. Open-code this function directly in blk_post_runtime_resume()
      and remove it.
      
      Signed-off-by: default avatarDamien Le Moal <dlemoal@kernel.org>
      Reviewed-by: default avatarBart Van Assche <bvanassche@acm.org>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Link: https://lore.kernel.org/r/20231120070611.33951-1-dlemoal@kernel.org
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c96b8175
    • Li Nan's avatar
      nbd: fix null-ptr-dereference while accessing 'nbd->config' · c2da049f
      Li Nan authored
      
      
      Memory reordering may occur in nbd_genl_connect(), causing config_refs
      to be set to 1 while nbd->config is still empty. Opening nbd at this
      time will cause null-ptr-dereference.
      
         T1                      T2
         nbd_open
          nbd_get_config_unlocked
                       	   nbd_genl_connect
                       	    nbd_alloc_and_init_config
                       	     //memory reordered
                        	     refcount_set(&nbd->config_refs, 1)  // 2
           nbd->config
            ->null point
      			     nbd->config = config  // 1
      
      Fix it by adding smp barrier to guarantee the execution sequence.
      
      Signed-off-by: default avatarLi Nan <linan122@huawei.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Link: https://lore.kernel.org/r/20231116162316.1740402-4-linan666@huaweicloud.com
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c2da049f
    • Li Nan's avatar
      nbd: factor out a helper to get nbd_config without holding 'config_lock' · 3123ac77
      Li Nan authored
      
      
      There are no functional changes, just to make code cleaner and prepare
      to fix null-ptr-dereference while accessing 'nbd->config'.
      
      Signed-off-by: default avatarLi Nan <linan122@huawei.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Link: https://lore.kernel.org/r/20231116162316.1740402-3-linan666@huaweicloud.com
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3123ac77
    • Li Nan's avatar
      nbd: fold nbd config initialization into nbd_alloc_config() · 1b598605
      Li Nan authored
      
      
      There are no functional changes, make the code cleaner and prepare to
      fix null-ptr-dereference while accessing 'nbd->config'.
      
      Signed-off-by: default avatarLi Nan <linan122@huawei.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Link: https://lore.kernel.org/r/20231116162316.1740402-2-linan666@huaweicloud.com
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      1b598605
    • Jens Axboe's avatar
      Merge tag 'md-fixes-20231120' of... · 8a554c62
      Jens Axboe authored
      Merge tag 'md-fixes-20231120' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into block-6.7
      
      Pull MD fix from Song.
      
      * tag 'md-fixes-20231120' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
        md: fix bi_status reporting in md_end_clone_io
      8a554c62
    • Coly Li's avatar
      bcache: avoid NULL checking to c->root in run_cache_set() · 3eba5e0b
      Coly Li authored
      
      
      In run_cache_set() after c->root returned from bch_btree_node_get(), it
      is checked by IS_ERR_OR_NULL(). Indeed it is unncessary to check NULL
      because bch_btree_node_get() will not return NULL pointer to caller.
      
      This patch replaces IS_ERR_OR_NULL() by IS_ERR() for the above reason.
      
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Link: https://lore.kernel.org/r/20231120052503.6122-11-colyli@suse.de
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3eba5e0b
    • Coly Li's avatar
      bcache: add code comments for bch_btree_node_get() and __bch_btree_node_alloc() · 31f5b956
      Coly Li authored
      
      
      This patch adds code comments to bch_btree_node_get() and
      __bch_btree_node_alloc() that NULL pointer will not be returned and it
      is unnecessary to check NULL pointer by the callers of these routines.
      
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Link: https://lore.kernel.org/r/20231120052503.6122-10-colyli@suse.de
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      31f5b956
    • Coly Li's avatar
      bcache: replace a mistaken IS_ERR() by IS_ERR_OR_NULL() in btree_gc_coalesce() · f72f4312
      Coly Li authored
      Commit 028ddcac ("bcache: Remove unnecessary NULL point check in
      node allocations") do the following change inside btree_gc_coalesce(),
      
      31 @@ -1340,7 +1340,7 @@ static int btree_gc_coalesce(
      32         memset(new_nodes, 0, sizeof(new_nodes));
      33         closure_init_stack(&cl);
      34
      35 -       while (nodes < GC_MERGE_NODES && !IS_ERR_OR_NULL(r[nodes].b))
      36 +       while (nodes < GC_MERGE_NODES && !IS_ERR(r[nodes].b))
      37                 keys += r[nodes++].keys;
      38
      39         blocks = btree_default_blocks(b->c) * 2 / 3;
      
      At line 35 the original r[nodes].b is not always allocatored from
      __bch_btree_node_alloc(), and possibly initialized as NULL pointer by
      caller of btree_gc_coalesce(). Therefore the change at line 36 is not
      correct.
      
      This patch replaces the mistaken IS_ERR() by IS_ERR_OR_NULL() to avoid
      potential issue.
      
      Fixes: 028ddcac
      
       ("bcache: Remove unnecessary NULL point check in node allocations")
      Cc:  <stable@vger.kernel.org> # 6.5+
      Cc: Zheng Wang <zyytlz.wz@163.com>
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Link: https://lore.kernel.org/r/20231120052503.6122-9-colyli@suse.de
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f72f4312
    • Mingzhe Zou's avatar
      bcache: fixup multi-threaded bch_sectors_dirty_init() wake-up race · 2faac25d
      Mingzhe Zou authored
      We get a kernel crash about "unable to handle kernel paging request":
      
      ```dmesg
      [368033.032005] BUG: unable to handle kernel paging request at ffffffffad9ae4b5
      [368033.032007] PGD fc3a0d067 P4D fc3a0d067 PUD fc3a0e063 PMD 8000000fc38000e1
      [368033.032012] Oops: 0003 [#1] SMP PTI
      [368033.032015] CPU: 23 PID: 55090 Comm: bch_dirtcnt[0] Kdump: loaded Tainted: G           OE    --------- -  - 4.18.0-147.5.1.es8_24.x86_64 #1
      [368033.032017] Hardware name: Tsinghua Tongfang THTF Chaoqiang Server/072T6D, BIOS 2.4.3 01/17/2017
      [368033.032027] RIP: 0010:native_queued_spin_lock_slowpath+0x183/0x1d0
      [368033.032029] Code: 8b 02 48 85 c0 74 f6 48 89 c1 eb d0 c1 e9 12 83 e0
      03 83 e9 01 48 c1 e0 05 48 63 c9 48 05 c0 3d 02 00 48 03 04 cd 60 68 93
      ad <48> 89 10 8b 42 08 85 c0 75 09 f3 90 8b 42 08 85 c0 74 f7 48 8b 02
      [368033.032031] RSP: 0018:ffffbb48852abe00 EFLAGS: 00010082
      [368033.032032] RAX: ffffffffad9ae4b5 RBX: 0000000000000246 RCX: 0000000000003bf3
      [368033.032033] RDX: ffff97b0ff8e3dc0 RSI: 0000000000600000 RDI: ffffbb4884743c68
      [368033.032034] RBP: 0000000000000001 R08: 0000000000000000 R09: 000007ffffffffff
      [368033.032035] R10: ffffbb486bb01000 R11: 0000000000000001 R12: ffffffffc068da70
      [368033.032036] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
      [368033.032038] FS:  0000000000000000(0000) GS:ffff97b0ff8c0000(0000) knlGS:0000000000000000
      [368033.032039] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [368033.032040] CR2: ffffffffad9ae4b5 CR3: 0000000fc3a0a002 CR4: 00000000003626e0
      [368033.032042] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [368033.032043] bcache: bch_cached_dev_attach() Caching rbd479 as bcache462 on set 8cff3c36-4a76-4242-afaa-7630206bc70b
      [368033.032045] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [368033.032046] Call Trace:
      [368033.032054]  _raw_spin_lock_irqsave+0x32/0x40
      [368033.032061]  __wake_up_common_lock+0x63/0xc0
      [368033.032073]  ? bch_ptr_invalid+0x10/0x10 [bcache]
      [368033.033502]  bch_dirty_init_thread+0x14c/0x160 [bcache]
      [368033.033511]  ? read_dirty_submit+0x60/0x60 [bcache]
      [368033.033516]  kthread+0x112/0x130
      [368033.033520]  ? kthread_flush_work_fn+0x10/0x10
      [368033.034505]  ret_from_fork+0x35/0x40
      ```
      
      The crash occurred when call wake_up(&state->wait), and then we want
      to look at the value in the state. However, bch_sectors_dirty_init()
      is not found in the stack of any task. Since state is allocated on
      the stack, we guess that bch_sectors_dirty_init() has exited, causing
      bch_dirty_init_thread() to be unable to handle kernel paging request.
      
      In order to verify this idea, we added some printing information during
      wake_up(&state->wait). We find that "wake up" is printed twice, however
      we only expect the last thread to wake up once.
      
      ```dmesg
      [  994.641004] alcache: bch_dirty_init_thread() wake up
      [  994.641018] alcache: bch_dirty_init_thread() wake up
      [  994.641523] alcache: bch_sectors_dirty_init() init exit
      ```
      
      There is a race. If bch_sectors_dirty_init() exits after the first wake
      up, the second wake up will trigger this bug("unable to handle kernel
      paging request").
      
      Proceed as follows:
      
      bch_sectors_dirty_init
          kthread_run ==============> bch_dirty_init_thread(bch_dirtcnt[0])
                  ...                         ...
          atomic_inc(&state.started)          ...
                  ...                         ...
          atomic_read(&state.enough)          ...
                  ...                 atomic_set(&state->enough, 1)
          kthread_run ======================================================> bch_dirty_init_thread(bch_dirtcnt[1])
                  ...                 atomic_dec_and_test(&state->started)            ...
          atomic_inc(&state.started)          ...                                     ...
                  ...                 wake_up(&state->wait)                           ...
          atomic_read(&state.enough)                                          atomic_dec_and_test(&state->started)
                  ...                                                                 ...
          wait_event(state.wait, atomic_read(&state.started) == 0)                    ...
          return                                                                      ...
                                                                              wake_up(&state->wait)
      
      We believe it is very common to wake up twice if there is no dirty, but
      crash is an extremely low probability event. It's hard for us to reproduce
      this issue. We attached and detached continuously for a week, with a total
      of more than one million attaches and only one crash.
      
      Putting atomic_inc(&state.started) before kthread_run() can avoid waking
      up twice.
      
      Fixes: b144e45f
      
       ("bcache: make bch_sectors_dirty_init() to be multithreaded")
      Signed-off-by: default avatarMingzhe Zou <mingzhe.zou@easystack.cn>
      Cc:  <stable@vger.kernel.org>
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Link: https://lore.kernel.org/r/20231120052503.6122-8-colyli@suse.de
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2faac25d
    • Mingzhe Zou's avatar
      bcache: fixup lock c->root error · e34820f9
      Mingzhe Zou authored
      We had a problem with io hung because it was waiting for c->root to
      release the lock.
      
      crash> cache_set.root -l cache_set.list ffffa03fde4c0050
        root = 0xffff802ef454c800
      crash> btree -o 0xffff802ef454c800 | grep rw_semaphore
        [ffff802ef454c858] struct rw_semaphore lock;
      crash> struct rw_semaphore ffff802ef454c858
      struct rw_semaphore {
        count = {
          counter = -4294967297
        },
        wait_list = {
          next = 0xffff00006786fc28,
          prev = 0xffff00005d0efac8
        },
        wait_lock = {
          raw_lock = {
            {
              val = {
                counter = 0
              },
              {
                locked = 0 '\000',
                pending = 0 '\000'
              },
              {
                locked_pending = 0,
                tail = 0
              }
            }
          }
        },
        osq = {
          tail = {
            counter = 0
          }
        },
        owner = 0xffffa03fdc586603
      }
      
      The "counter = -4294967297" means that lock count is -1 and a write lock
      is being attempted. Then, we found that there is a btree with a counter
      of 1 in btree_cache_freeable.
      
      crash> cache_set -l cache_set.list ffffa03fde4c0050 -o|grep btree_cache
        [ffffa03fde4c1140] struct list_head btree_cache;
        [ffffa03fde4c1150] struct list_head btree_cache_freeable;
        [ffffa03fde4c1160] struct list_head btree_cache_freed;
        [ffffa03fde4c1170] unsigned int btree_cache_used;
        [ffffa03fde4c1178] wait_queue_head_t btree_cache_wait;
        [ffffa03fde4c1190] struct task_struct *btree_cache_alloc_lock;
      crash> list -H ffffa03fde4c1140|wc -l
      973
      crash> list -H ffffa03fde4c1150|wc -l
      1123
      crash> cache_set.btree_cache_used -l cache_set.list ffffa03fde4c0050
        btree_cache_used = 2097
      crash> list -s btree -l btree.list -H ffffa03fde4c1140|grep -E -A2 "^  lock = {" > btree_cache.txt
      crash> list -s btree -l btree.list -H ffffa03fde4c1150|grep -E -A2 "^  lock = {" > btree_cache_freeable.txt
      [root@node-3 127.0.0.1-2023-08-04-16:40:28]# pwd
      /var/crash/127.0.0.1-2023-08-04-16:40:28
      [root@node-3 127.0.0.1-2023-08-04-16:40:28]# cat btree_cache.txt|grep counter|grep -v "counter = 0"
      [root@node-3 127.0.0.1-2023-08-04-16:40:28]# cat btree_cache_freeable.txt|grep counter|grep -v "counter = 0"
            counter = 1
      
      We found that this is a bug in bch_sectors_dirty_init() when locking c->root:
          (1). Thread X has locked c->root(A) write.
          (2). Thread Y failed to lock c->root(A), waiting for the lock(c->root A).
          (3). Thread X bch_btree_set_root() changes c->root from A to B.
          (4). Thread X releases the lock(c->root A).
          (5). Thread Y successfully locks c->root(A).
          (6). Thread Y releases the lock(c->root B).
      
              down_write locked ---(1)----------------------┐
                      |                                     |
                      |   down_read waiting ---(2)----┐     |
                      |           |               ┌-------------┐ ┌-------------┐
              bch_btree_set_root ===(3)========>> | c->root   A | | c->root   B |
                      |           |               └-------------┘ └-------------┘
                  up_write ---(4)---------------------┘     |            |
                                  |                         |            |
                          down_read locked ---(5)-----------┘            |
                                  |                                      |
                              up_read ---(6)-----------------------------┘
      
      Since c->root may change, the correct steps to lock c->root should be
      the same as bch_root_usage(), compare after locking.
      
      static unsigned int bch_root_usage(struct cache_set *c)
      {
              unsigned int bytes = 0;
              struct bkey *k;
              struct btree *b;
              struct btree_iter iter;
      
              goto lock_root;
      
              do {
                      rw_unlock(false, b);
      lock_root:
                      b = c->root;
                      rw_lock(false, b, b->level);
              } while (b != c->root);
      
              for_each_key_filter(&b->keys, k, &iter, bch_ptr_bad)
                      bytes += bkey_bytes(k);
      
              rw_unlock(false, b);
      
              return (bytes * 100) / btree_bytes(c);
      }
      
      Fixes: b144e45f
      
       ("bcache: make bch_sectors_dirty_init() to be multithreaded")
      Signed-off-by: default avatarMingzhe Zou <mingzhe.zou@easystack.cn>
      Cc:  <stable@vger.kernel.org>
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Link: https://lore.kernel.org/r/20231120052503.6122-7-colyli@suse.de
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e34820f9
    • Mingzhe Zou's avatar
      bcache: fixup init dirty data errors · 7cc47e64
      Mingzhe Zou authored
      We found that after long run, the dirty_data of the bcache device
      will have errors. This error cannot be eliminated unless re-register.
      
      We also found that reattach after detach, this error can accumulate.
      
      In bch_sectors_dirty_init(), all inode <= d->id keys will be recounted
      again. This is wrong, we only need to count the keys of the current
      device.
      
      Fixes: b144e45f
      
       ("bcache: make bch_sectors_dirty_init() to be multithreaded")
      Signed-off-by: default avatarMingzhe Zou <mingzhe.zou@easystack.cn>
      Cc:  <stable@vger.kernel.org>
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Link: https://lore.kernel.org/r/20231120052503.6122-6-colyli@suse.de
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7cc47e64
    • Rand Deeb's avatar
      bcache: prevent potential division by zero error · 2c7f497a
      Rand Deeb authored
      
      
      In SHOW(), the variable 'n' is of type 'size_t.' While there is a
      conditional check to verify that 'n' is not equal to zero before
      executing the 'do_div' macro, concerns arise regarding potential
      division by zero error in 64-bit environments.
      
      The concern arises when 'n' is 64 bits in size, greater than zero, and
      the lower 32 bits of it are zeros. In such cases, the conditional check
      passes because 'n' is non-zero, but the 'do_div' macro casts 'n' to
      'uint32_t,' effectively truncating it to its lower 32 bits.
      Consequently, the 'n' value becomes zero.
      
      To fix this potential division by zero error and ensure precise
      division handling, this commit replaces the 'do_div' macro with
      div64_u64(). div64_u64() is designed to work with 64-bit operands,
      guaranteeing that division is performed correctly.
      
      This change enhances the robustness of the code, ensuring that division
      operations yield accurate results in all scenarios, eliminating the
      possibility of division by zero, and improving compatibility across
      different 64-bit environments.
      
      Found by Linux Verification Center (linuxtesting.org) with SVACE.
      
      Signed-off-by: default avatarRand Deeb <rand.sec96@gmail.com>
      Cc:  <stable@vger.kernel.org>
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Link: https://lore.kernel.org/r/20231120052503.6122-5-colyli@suse.de
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2c7f497a
    • Colin Ian King's avatar
      bcache: remove redundant assignment to variable cur_idx · be93825f
      Colin Ian King authored
      
      
      Variable cur_idx is being initialized with a value that is never read,
      it is being re-assigned later in a while-loop. Remove the redundant
      assignment. Cleans up clang scan build warning:
      
      drivers/md/bcache/writeback.c:916:2: warning: Value stored to 'cur_idx'
      is never read [deadcode.DeadStores]
      
      Signed-off-by: default avatarColin Ian King <colin.i.king@gmail.com>
      Reviewed-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Link: https://lore.kernel.org/r/20231120052503.6122-4-colyli@suse.de
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      be93825f
    • Coly Li's avatar
      bcache: check return value from btree_node_alloc_replacement() · 777967e7
      Coly Li authored
      
      
      In btree_gc_rewrite_node(), pointer 'n' is not checked after it returns
      from btree_gc_rewrite_node(). There is potential possibility that 'n' is
      a non NULL ERR_PTR(), referencing such error code is not permitted in
      following code. Therefore a return value checking is necessary after 'n'
      is back from btree_node_alloc_replacement().
      
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Reported-by: default avatarDan Carpenter <dan.carpenter@linaro.org>
      Cc:  <stable@vger.kernel.org>
      Link: https://lore.kernel.org/r/20231120052503.6122-3-colyli@suse.de
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      777967e7
    • Coly Li's avatar
      bcache: avoid oversize memory allocation by small stripe_size · baf8fb7e
      Coly Li authored
      
      
      Arraies bcache->stripe_sectors_dirty and bcache->full_dirty_stripes are
      used for dirty data writeback, their sizes are decided by backing device
      capacity and stripe size. Larger backing device capacity or smaller
      stripe size make these two arraies occupies more dynamic memory space.
      
      Currently bcache->stripe_size is directly inherited from
      queue->limits.io_opt of underlying storage device. For normal hard
      drives, its limits.io_opt is 0, and bcache sets the corresponding
      stripe_size to 1TB (1<<31 sectors), it works fine 10+ years. But for
      devices do declare value for queue->limits.io_opt, small stripe_size
      (comparing to 1TB) becomes an issue for oversize memory allocations of
      bcache->stripe_sectors_dirty and bcache->full_dirty_stripes, while the
      capacity of hard drives gets much larger in recent decade.
      
      For example a raid5 array assembled by three 20TB hardrives, the raid
      device capacity is 40TB with typical 512KB limits.io_opt. After the math
      calculation in bcache code, these two arraies will occupy 400MB dynamic
      memory. Even worse Andrea Tomassetti reports that a 4KB limits.io_opt is
      declared on a new 2TB hard drive, then these two arraies request 2GB and
      512MB dynamic memory from kzalloc(). The result is that bcache device
      always fails to initialize on his system.
      
      To avoid the oversize memory allocation, bcache->stripe_size should not
      directly inherited by queue->limits.io_opt from the underlying device.
      This patch defines BCH_MIN_STRIPE_SZ (4MB) as minimal bcache stripe size
      and set bcache device's stripe size against the declared limits.io_opt
      value from the underlying storage device,
      - If the declared limits.io_opt > BCH_MIN_STRIPE_SZ, bcache device will
        set its stripe size directly by this limits.io_opt value.
      - If the declared limits.io_opt < BCH_MIN_STRIPE_SZ, bcache device will
        set its stripe size by a value multiplying limits.io_opt and euqal or
        large than BCH_MIN_STRIPE_SZ.
      
      Then the minimal stripe size of a bcache device will always be >= 4MB.
      For a 40TB raid5 device with 512KB limits.io_opt, memory occupied by
      bcache->stripe_sectors_dirty and bcache->full_dirty_stripes will be 50MB
      in total. For a 2TB hard drive with 4KB limits.io_opt, memory occupied
      by these two arraies will be 2.5MB in total.
      
      Such mount of memory allocated for bcache->stripe_sectors_dirty and
      bcache->full_dirty_stripes is reasonable for most of storage devices.
      
      Reported-by: default avatarAndrea Tomassetti <andrea.tomassetti-opensource@devo.com>
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Reviewed-by: default avatarEric Wheeler <bcache@lists.ewheeler.net>
      Link: https://lore.kernel.org/r/20231120052503.6122-2-colyli@suse.de
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      baf8fb7e
  3. Nov 20, 2023
  4. Nov 18, 2023
  5. Nov 13, 2023
    • Christoph Hellwig's avatar
      blk-mq: make sure active queue usage is held for bio_integrity_prep() · b0077e26
      Christoph Hellwig authored
      blk_integrity_unregister() can come if queue usage counter isn't held
      for one bio with integrity prepared, so this request may be completed with
      calling profile->complete_fn, then kernel panic.
      
      Another constraint is that bio_integrity_prep() needs to be called
      before bio merge.
      
      Fix the issue by:
      
      - call bio_integrity_prep() with one queue usage counter grabbed reliably
      
      - call bio_integrity_prep() before bio merge
      
      Fixes: 900e0807
      
       ("block: move queue enter logic into blk_mq_submit_bio()")
      Reported-by: default avatarYi Zhang <yi.zhang@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Tested-by: default avatarYi Zhang <yi.zhang@redhat.com>
      Link: https://lore.kernel.org/r/20231113035231.2708053-1-ming.lei@redhat.com
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b0077e26
    • Linus Torvalds's avatar
      Linux 6.7-rc1 · b85ea95d
      Linus Torvalds authored
      v6.7-rc1
      b85ea95d
    • Miri Korenblit's avatar
      wifi: iwlwifi: fix system commands group ordering · e257da57
      Miri Korenblit authored
      The commands should be sorted inside the group definition.
      Fix the ordering so we won't get following warning:
      WARN_ON(iwl_cmd_groups_verify_sorted(trans_cfg))
      
      Link: https://lore.kernel.org/regressions/2fa930bb-54dd-4942-a88d-05a47c8e9731@gmail.com/
      Link: https://lore.kernel.org/linux-wireless/CAHk-=wix6kqQ5vHZXjOPpZBfM7mMm9bBZxi2Jh7XnaKCqVf94w@mail.gmail.com/
      Fixes: b6e3d1ba
      
       ("wifi: iwlwifi: mvm: implement new firmware API for statistics")
      Tested-by: default avatarNiklāvs Koļesņikovs <pinkflames.linux@gmail.com>
      Tested-by: default avatarDamian Tometzki <damian@riscv-rocks.de>
      Acked-by: default avatarKalle Valo <kvalo@kernel.org>
      Signed-off-by: default avatarMiri Korenblit <miriam.rachel.korenblit@intel.com>
      Signed-off-by: default avatarEmmanuel Grumbach <emmanuel.grumbach@intel.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e257da57
    • Linus Torvalds's avatar
      Merge tag 'parisc-for-6.7-rc1-2' of... · b57b17e8
      Linus Torvalds authored
      Merge tag 'parisc-for-6.7-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
      
      Pull parisc architecture fixes from Helge Deller:
      
       - Include the upper 5 address bits when inserting TLB entries on a
         64-bit kernel.
      
         On physical machines those are ignored, but in qemu it's nice to have
         them included and to be correct.
      
       - Stop the 64-bit kernel and show a warning if someone tries to boot on
         a machine with a 32-bit CPU
      
       - Fix a "no previous prototype" warning in parport-gsc
      
      * tag 'parisc-for-6.7-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
        parisc: Prevent booting 64-bit kernels on PA1.x machines
        parport: gsc: mark init function static
        parisc/pgtable: Do not drop upper 5 address bits of physical address
      b57b17e8
    • Linus Torvalds's avatar
      Merge tag 'loongarch-6.7' of... · 4eeee663
      Linus Torvalds authored
      Merge tag 'loongarch-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson
      
      Pull LoongArch updates from Huacai Chen:
      
       - support PREEMPT_DYNAMIC with static keys
      
       - relax memory ordering for atomic operations
      
       - support BPF CPU v4 instructions for LoongArch
      
       - some build and runtime warning fixes
      
      * tag 'loongarch-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson:
        selftests/bpf: Enable cpu v4 tests for LoongArch
        LoongArch: BPF: Support signed mod instructions
        LoongArch: BPF: Support signed div instructions
        LoongArch: BPF: Support 32-bit offset jmp instructions
        LoongArch: BPF: Support unconditional bswap instructions
        LoongArch: BPF: Support sign-extension mov instructions
        LoongArch: BPF: Support sign-extension load instructions
        LoongArch: Add more instruction opcodes and emit_* helpers
        LoongArch/smp: Call rcutree_report_cpu_starting() earlier
        LoongArch: Relax memory ordering for atomic operations
        LoongArch: Mark __percpu functions as always inline
        LoongArch: Disable module from accessing external data directly
        LoongArch: Support PREEMPT_DYNAMIC with static keys
      4eeee663
    • Linus Torvalds's avatar
      Merge tag 'powerpc-6.7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux · 5dd2020f
      Linus Torvalds authored
      Pull powerpc fixes from Michael Ellerman:
      
       - Finish a refactor of pgprot_framebuffer() which dependend
         on some changes that were merged via the drm tree
      
       - Fix some kernel-doc warnings to quieten the bots
      
      Thanks to Nathan Lynch and Thomas Zimmermann.
      
      * tag 'powerpc-6.7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
        powerpc/rtas: Fix ppc_rtas_rmo_buf_show() kernel-doc
        powerpc/pseries/rtas-work-area: Fix rtas_work_area_reserve_arena() kernel-doc
        powerpc/fb: Call internal __phys_mem_access_prot() in fbdev code
        powerpc: Remove file parameter from phys_mem_access_prot()
        powerpc/machdep: Remove trailing whitespaces
      5dd2020f