Skip to content
  1. Jun 23, 2023
    • Linus Torvalds's avatar
      Merge tag 'cgroup-for-6.4-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup · 5950a006
      Linus Torvalds authored
      Pull cgroup fixes from Tejun Heo:
       "It's late but here are two bug fixes. Both fix problems which can be
        severe but are very confined in scope. The risk to most use cases
        should be minimal.
      
         - Fix for an old bug which triggers if a cgroup subsystem is
           remounted to a different hierarchy while someone is reading its
           cgroup.procs/tasks file. The risk is pretty low given how seldom
           cgroup subsystems are moved across hierarchies.
      
         - We moved cpus_read_lock() outside of cgroup internal locks a while
           ago but forgot to update the legacy_freezer leading to lockdep
           triggers. Fixed"
      
      * tag 'cgroup-for-6.4-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
        cgroup: Do not corrupt task iteration when rebinding subsystem
        cgroup,freezer: hold cpu_hotplug_lock before freezer_mutex in freezer_css_{online,offline}()
      5950a006
  2. Jun 22, 2023
  3. Jun 21, 2023
    • Michael S. Tsirkin's avatar
      Revert "virtio-blk: support completion batching for the IRQ path" · afd384f0
      Michael S. Tsirkin authored
      This reverts commit 07b679f7
      
      .
      
      This change appears to have broken things...
      We now see applications hanging during disk accesses.
      e.g.
      multi-port virtio-blk device running in h/w (FPGA)
      Host running a simple 'fio' test.
      [global]
      thread=1
      direct=1
      ioengine=libaio
      norandommap=1
      group_reporting=1
      bs=4K
      rw=read
      iodepth=128
      runtime=1
      numjobs=4
      time_based
      [job0]
      filename=/dev/vda
      [job1]
      filename=/dev/vdb
      [job2]
      filename=/dev/vdc
      ...
      [job15]
      filename=/dev/vdp
      
      i.e. 16 disks; 4 queues per disk; simple burst of 4KB reads
      This is repeatedly run in a loop.
      
      After a few, normally <10 seconds, fio hangs.
      With 64 queues (16 disks), failure occurs within a few seconds; with 8 queues (2 disks) it may take ~hour before hanging.
      Last message:
      fio-3.19
      Starting 8 threads
      Jobs: 1 (f=1): [_(7),R(1)][68.3%][eta 03h:11m:06s]
      I think this means at the end of the run 1 queue was left incomplete.
      
      'diskstats' (run while fio is hung) shows no outstanding transactions.
      e.g.
      $ cat /proc/diskstats
      ...
      252       0 vda 1843140071 0 14745120568 712568645 0 0 0 0 0 3117947 712568645 0 0 0 0 0 0
      252      16 vdb 1816291511 0 14530332088 704905623 0 0 0 0 0 3117711 704905623 0 0 0 0 0 0
      ...
      
      Other stats (in the h/w, and added to the virtio-blk driver ([a]virtio_queue_rq(), [b]virtblk_handle_req(), [c]virtblk_request_done()) all agree, and show every request had a completion, and that virtblk_request_done() never gets called.
      e.g.
      PF= 0                         vq=0           1           2           3
      [a]request_count     -   839416590   813148916   105586179    84988123
      [b]completion1_count -   839416590   813148916   105586179    84988123
      [c]completion2_count -           0           0           0           0
      
      PF= 1                         vq=0           1           2           3
      [a]request_count     -   823335887   812516140   104582672    75856549
      [b]completion1_count -   823335887   812516140   104582672    75856549
      [c]completion2_count -           0           0           0           0
      
      i.e. the issue is after the virtio-blk driver.
      
      This change was introduced in kernel 6.3.0.
      I am seeing this using 6.3.3.
      If I run with an earlier kernel (5.15), it does not occur.
      If I make a simple patch to the 6.3.3 virtio-blk driver, to skip the blk_mq_add_to_batch()call, it does not fail.
      e.g.
      kernel 5.15 - this is OK
      virtio_blk.c,virtblk_done() [irq handler]
                       if (likely(!blk_should_fake_timeout(req->q))) {
                                blk_mq_complete_request(req);
                       }
      
      kernel 6.3.3 - this fails
      virtio_blk.c,virtblk_handle_req() [irq handler]
                       if (likely(!blk_should_fake_timeout(req->q))) {
                                if (!blk_mq_complete_request_remote(req)) {
                                        if (!blk_mq_add_to_batch(req, iob, virtblk_vbr_status(vbr), virtblk_complete_batch)) {
                                                 virtblk_request_done(req);    //this never gets called... so blk_mq_add_to_batch() must always succeed
                                         }
                                }
                       }
      
      If I do, kernel 6.3.3 - this is OK
      virtio_blk.c,virtblk_handle_req() [irq handler]
                       if (likely(!blk_should_fake_timeout(req->q))) {
                                if (!blk_mq_complete_request_remote(req)) {
                                         virtblk_request_done(req); //force this here...
                                        if (!blk_mq_add_to_batch(req, iob, virtblk_vbr_status(vbr), virtblk_complete_batch)) {
                                                 virtblk_request_done(req);    //this never gets called... so blk_mq_add_to_batch() must always succeed
                                         }
                                }
                       }
      
      Perhaps you might like to fix/test/revert this change...
      Martin
      
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Closes: https://lore.kernel.org/oe-kbuild-all/202306090826.C1fZmdMe-lkp@intel.com/
      
      
      Cc: Suwan Kim <suwan.kim027@gmail.com>
      Tested-by: default avatar <edliaw@google.com>
      Reported-by: default avatar"Roberts, Martin" <martin.roberts@intel.com>
      Message-Id: <336455b4f630f329380a8f53ee8cad3868764d5c.1686295549.git.mst@redhat.com>
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      afd384f0
    • Linus Torvalds's avatar
      Merge tag 'mm-hotfixes-stable-2023-06-20-12-31' of... · 8ba90f5c
      Linus Torvalds authored
      Merge tag 'mm-hotfixes-stable-2023-06-20-12-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
      
      Pull hotfixes from Andrew Morton:
       "19 hotfixes.  8 of these are cc:stable.
      
        This includes a wholesale reversion of the post-6.4 series 'make slab
        shrink lockless'. After input from Dave Chinner it has been decided
        that we should go a different way [1]"
      
      Link: https://lkml.kernel.org/r/ZH6K0McWBeCjaf16@dread.disaster.area [1]
      
      * tag 'mm-hotfixes-stable-2023-06-20-12-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
        selftests/mm: fix cross compilation with LLVM
        mailmap: add entries for Ben Dooks
        nilfs2: prevent general protection fault in nilfs_clear_dirty_page()
        Revert "mm: vmscan: make global slab shrink lockless"
        Revert "mm: vmscan: make memcg slab shrink lockless"
        Revert "mm: vmscan: add shrinker_srcu_generation"
        Revert "mm: shrinkers: make count and scan in shrinker debugfs lockless"
        Revert "mm: vmscan: hold write lock to reparent shrinker nr_deferred"
        Revert "mm: vmscan: remove shrinker_rwsem from synchronize_shrinkers()"
        Revert "mm: shrinkers: convert shrinker_rwsem to mutex"
        nilfs2: fix buffer corruption due to concurrent device reads
        scripts/gdb: fix SB_* constants parsing
        scripts: fix the gfp flags header path in gfp-translate
        udmabuf: revert 'Add support for mapping hugepages (v4)'
        mm/khugepaged: fix iteration in collapse_file
        memfd: check for non-NULL file_seals in memfd_create() syscall
        mm/vmalloc: do not output a spurious warning when huge vmalloc() fails
        mm/mprotect: fix do_mprotect_pkey() limit check
        writeback: fix dereferencing NULL mapping->host on writeback_page_template
      8ba90f5c
    • Linus Torvalds's avatar
      Merge tag 'acpi-6.4-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · e660abd5
      Linus Torvalds authored
      Pull ACPI fix from Rafael Wysocki:
       "Fix a kernel crash during early resume from ACPI S3 that has been
        present since the 5.15 cycle when might_sleep() was added to
        down_timeout(), which in some configurations of the kernel caused an
        implicit preemption point to trigger at a wrong time"
      
      * tag 'acpi-6.4-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
        ACPI: sleep: Avoid breaking S3 wakeup due to might_sleep()
      e660abd5
    • Linus Torvalds's avatar
      Merge tag 'thermal-6.4-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · c74e2ac2
      Linus Torvalds authored
      Pull thermal control fix from Rafael Wysocki:
       "Fix a regression introduced during the 6.3 cycle causing
        intel_soc_dts_iosf to report incorrect temperature values
        due to a coding mistake (Hans de Goede)"
      
      * tag 'thermal-6.4-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
        thermal/intel/intel_soc_dts_iosf: Fix reporting wrong temperatures
      c74e2ac2
    • Linus Torvalds's avatar
      Merge tag 'trace-v6.4-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace · 2e30b973
      Linus Torvalds authored
      Pull tracing fixes from Steven Rostedt:
      
       - Fix MAINTAINERS file to point to proper mailing list for rtla and rv
      
         The mailing list pointed to linux-trace-devel instead of
         linux-trace-kernel. The former is for the tracing libraries and the
         latter is for anything in the Linux kernel tree. The wrong mailing
         list was used because linux-trace-kernel did not exist when rtla and
         rv were created.
      
       - User events:
      
          - Fix matching of dynamic events to their user events
      
            When user writes to dynamic_events file, a lookup of the
            registered dynamic events is made, but there were some cases that
            a match could be incorrectly made.
      
          - Add auto cleanup of user events
      
            Have the user events automatically get removed when the last
            reference (file descriptor) is closed. This was asked for to
            prevent leaks of user events hanging around needing admins to
            clean them up.
      
          - Add persistent logic (but not let user space use it yet)
      
            In some cases, having a persistent user event (one that does not
            get cleaned up automatically) is useful. But there's still debates
            about how to expose this to user space. The infrastructure is
            added, but the API is not.
      
          - Update the selftests
      
            Update the user event selftests to reflect the above changes"
      
      * tag 'trace-v6.4-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
        tracing/user_events: Document auto-cleanup and remove dyn_event refs
        selftests/user_events: Adapt dyn_test to non-persist events
        selftests/user_events: Ensure auto cleanup works as expected
        tracing/user_events: Add auto cleanup and future persist flag
        tracing/user_events: Track refcount consistently via put/get
        tracing/user_events: Store register flags on events
        tracing/user_events: Remove user_ns walk for groups
        selftests/user_events: Add perf self-test for empty arguments events
        selftests/user_events: Clear the events after perf self-test
        selftests/user_events: Add ftrace self-test for empty arguments events
        tracing/user_events: Fix the incorrect trace record for empty arguments events
        tracing: Modify print_fields() for fields output order
        tracing/user_events: Handle matching arguments that is null from dyn_events
        tracing/user_events: Prevent same name but different args event
        tracing/rv/rtla: Update MAINTAINERS file to point to proper mailing list
      2e30b973
    • Linus Torvalds's avatar
      Merge tag 'for-6.4-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux · 4b0c7a1b
      Linus Torvalds authored
      Pull btrfs fix from David Sterba:
       "One more regression fix for an assertion failure that uncovered a
        nasty problem with stripe calculations. This is caused by a u32
        overflow when there are enough devices. The fstests require 6 so this
        hasn't been caught, I was able to hit it with 8.
      
        The fix is minimal and only adds u64 casts, we'll clean that up later.
        I did various additional tests to be sure"
      
      * tag 'for-6.4-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
        btrfs: fix u32 overflows when left shifting stripe_nr
      4b0c7a1b
    • Russ Weight's avatar
      regmap: spi-avmm: Fix regmap_bus max_raw_write · c8e79689
      Russ Weight authored
      The max_raw_write member of the regmap_spi_avmm_bus structure is defined
      as:
      	.max_raw_write = SPI_AVMM_VAL_SIZE * MAX_WRITE_CNT
      
      SPI_AVMM_VAL_SIZE == 4 and MAX_WRITE_CNT == 1 so this results in a
      maximum write transfer size of 4 bytes which provides only enough space to
      transfer the address of the target register. It provides no space for the
      value to be transferred. This bug became an issue (divide-by-zero in
      _regmap_raw_write()) after the following was accepted into mainline:
      
      commit 39815141 ("regmap: Account for register length when chunking")
      
      Change max_raw_write to include space (4 additional bytes) for both the
      register address and value:
      
      	.max_raw_write = SPI_AVMM_REG_SIZE + SPI_AVMM_VAL_SIZE * MAX_WRITE_CNT
      
      Fixes: 7f9fb673
      
       ("regmap: add Intel SPI Slave to AVMM Bus Bridge support")
      Reviewed-by: default avatarMatthew Gerlach <matthew.gerlach@linux.intel.com>
      Signed-off-by: default avatarRuss Weight <russell.h.weight@intel.com>
      Link: https://lore.kernel.org/r/20230620202824.380313-1-russell.h.weight@intel.com
      
      
      Signed-off-by: default avatarMark Brown <broonie@kernel.org>
      c8e79689
    • Linus Torvalds's avatar
      Merge tag '6.4-rc6-smb3-server-fixes' of git://git.samba.org/ksmbd · 99ec1ed7
      Linus Torvalds authored
      Pull smb server fixes from Steve French:
       "Four smb3 server fixes, all also for stable:
      
         - fix potential oops in parsing compounded requests
      
         - fix various paths (mkdir, create etc) where mnt_want_write was not
           checked first
      
         - fix slab out of bounds in check_message and write"
      
      * tag '6.4-rc6-smb3-server-fixes' of git://git.samba.org/ksmbd:
        ksmbd: validate session id and tree id in the compound request
        ksmbd: fix out-of-bound read in smb2_write
        ksmbd: add mnt_want_write to ksmbd vfs functions
        ksmbd: validate command payload size
      99ec1ed7
    • Qu Wenruo's avatar
      btrfs: fix u32 overflows when left shifting stripe_nr · a7299a18
      Qu Wenruo authored
      [BUG]
      David reported an ASSERT() get triggered during fio load on 8 devices
      with data/raid6 and metadata/raid1c3:
      
        fio --rw=randrw --randrepeat=1 --size=3000m \
      	  --bsrange=512b-64k --bs_unaligned \
      	  --ioengine=libaio --fsync=1024 \
      	  --name=job0 --name=job1 \
      
      The ASSERT() is from rbio_add_bio() of raid56.c:
      
      	ASSERT(orig_logical >= full_stripe_start &&
      	       orig_logical + orig_len <= full_stripe_start +
      	       rbio->nr_data * BTRFS_STRIPE_LEN);
      
      Which is checking if the target rbio is crossing the full stripe
      boundary.
      
        [100.789] assertion failed: orig_logical >= full_stripe_start && orig_logical + orig_len <= full_stripe_start + rbio->nr_data * BTRFS_STRIPE_LEN, in fs/btrfs/raid56.c:1622
        [100.795] ------------[ cut here ]------------
        [100.796] kernel BUG at fs/btrfs/raid56.c:1622!
        [100.797] invalid opcode: 0000 [#1] PREEMPT SMP KASAN
        [100.798] CPU: 1 PID: 100 Comm: kworker/u8:4 Not tainted 6.4.0-rc6-default+ #124
        [100.799] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552-rebuilt.opensuse.org 04/01/2014
        [100.802] Workqueue: writeback wb_workfn (flush-btrfs-1)
        [100.803] RIP: 0010:rbio_add_bio+0x204/0x210 [btrfs]
        [100.806] RSP: 0018:ffff888104a8f300 EFLAGS: 00010246
        [100.808] RAX: 00000000000000a1 RBX: ffff8881075907e0 RCX: ffffed1020951e01
        [100.809] RDX: 0000000000000000 RSI: 0000000000000008 RDI: 0000000000000001
        [100.811] RBP: 0000000141d20000 R08: 0000000000000001 R09: ffff888104a8f04f
        [100.813] R10: ffffed1020951e09 R11: 0000000000000003 R12: ffff88810e87f400
        [100.815] R13: 0000000041d20000 R14: 0000000144529000 R15: ffff888101524000
        [100.817] FS:  0000000000000000(0000) GS:ffff88811ac00000(0000) knlGS:0000000000000000
        [100.821] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [100.822] CR2: 000055d54e44c270 CR3: 000000010a9a1006 CR4: 00000000003706a0
        [100.824] Call Trace:
        [100.825]  <TASK>
        [100.825]  ? die+0x32/0x80
        [100.826]  ? do_trap+0x12d/0x160
        [100.827]  ? rbio_add_bio+0x204/0x210 [btrfs]
        [100.827]  ? rbio_add_bio+0x204/0x210 [btrfs]
        [100.829]  ? do_error_trap+0x90/0x130
        [100.830]  ? rbio_add_bio+0x204/0x210 [btrfs]
        [100.831]  ? handle_invalid_op+0x2c/0x30
        [100.833]  ? rbio_add_bio+0x204/0x210 [btrfs]
        [100.835]  ? exc_invalid_op+0x29/0x40
        [100.836]  ? asm_exc_invalid_op+0x16/0x20
        [100.837]  ? rbio_add_bio+0x204/0x210 [btrfs]
        [100.837]  raid56_parity_write+0x64/0x270 [btrfs]
        [100.838]  btrfs_submit_chunk+0x26e/0x800 [btrfs]
        [100.840]  ? btrfs_bio_init+0x80/0x80 [btrfs]
        [100.841]  ? release_pages+0x503/0x6d0
        [100.842]  ? folio_unlock+0x2f/0x60
        [100.844]  ? __folio_put+0x60/0x60
        [100.845]  ? btrfs_do_readpage+0xae0/0xae0 [btrfs]
        [100.847]  btrfs_submit_bio+0x21/0x60 [btrfs]
        [100.847]  submit_one_bio+0x6a/0xb0 [btrfs]
        [100.849]  extent_write_cache_pages+0x395/0x680 [btrfs]
        [100.850]  ? __extent_writepage+0x520/0x520 [btrfs]
        [100.851]  ? mark_usage+0x190/0x190
        [100.852]  extent_writepages+0xdb/0x130 [btrfs]
        [100.853]  ? extent_write_locked_range+0x480/0x480 [btrfs]
        [100.854]  ? mark_usage+0x190/0x190
        [100.854]  ? attach_extent_buffer_page+0x220/0x220 [btrfs]
        [100.855]  ? reacquire_held_locks+0x178/0x280
        [100.856]  ? writeback_sb_inodes+0x245/0x7f0
        [100.857]  do_writepages+0x102/0x2e0
        [100.858]  ? page_writeback_cpu_online+0x10/0x10
        [100.859]  ? __lock_release.isra.0+0x14a/0x4d0
        [100.860]  ? reacquire_held_locks+0x280/0x280
        [100.861]  ? __lock_acquired+0x1e9/0x3d0
        [100.862]  ? do_raw_spin_lock+0x1b0/0x1b0
        [100.863]  __writeback_single_inode+0x94/0x450
        [100.864]  writeback_sb_inodes+0x372/0x7f0
        [100.864]  ? lock_sync+0xd0/0xd0
        [100.865]  ? do_raw_spin_unlock+0x93/0xf0
        [100.866]  ? sync_inode_metadata+0xc0/0xc0
        [100.867]  ? rwsem_optimistic_spin+0x340/0x340
        [100.868]  __writeback_inodes_wb+0x70/0x130
        [100.869]  wb_writeback+0x2d1/0x530
        [100.869]  ? __writeback_inodes_wb+0x130/0x130
        [100.870]  ? lockdep_hardirqs_on_prepare.part.0+0xf1/0x1c0
        [100.870]  wb_do_writeback+0x3eb/0x480
        [100.871]  ? wb_writeback+0x530/0x530
        [100.871]  ? mark_lock_irq+0xcd0/0xcd0
        [100.872]  wb_workfn+0xe0/0x3f0<
      
      [CAUSE]
      Commit a97699d1
      
       ("btrfs: replace map_lookup->stripe_len by
      BTRFS_STRIPE_LEN") changes how we calculate the map length, to reduce
      u64 division.
      
      Function btrfs_max_io_len() is to get the length to the stripe boundary.
      
      It calculates the full stripe start offset (inside the chunk) by the
      following code:
      
      		*full_stripe_start =
      			rounddown(*stripe_nr, nr_data_stripes(map)) <<
      			BTRFS_STRIPE_LEN_SHIFT;
      
      The calculation itself is fine, but the value returned by rounddown() is
      dependent on both @stripe_nr (which is u32) and nr_data_stripes() (which
      returned int).
      
      Thus the result is also u32, then we do the left shift, which can
      overflow u32.
      
      If such overflow happens, @full_stripe_start will be a value way smaller
      than @offset, causing later "full_stripe_len - (offset -
      *full_stripe_start)" to underflow, thus make later length calculation to
      have no stripe boundary limit, resulting a write bio to exceed stripe
      boundary.
      
      There are some other locations like this, with a u32 @stripe_nr got left
      shift, which can lead to a similar overflow.
      
      [FIX]
      Fix all @stripe_nr with left shift with a type cast to u64 before the
      left shift.
      
      Those involved @stripe_nr or similar variables are recording the stripe
      number inside the chunk, which is small enough to be contained by u32,
      but their offset inside the chunk can not fit into u32.
      
      Thus for those specific left shifts, a type cast to u64 is necessary so
      this patch does not touch them and the code will be cleaned up in the
      future to keep the fix minimal.
      
      Reported-by: default avatarDavid Sterba <dsterba@suse.com>
      Fixes: a97699d1
      
       ("btrfs: replace map_lookup->stripe_len by BTRFS_STRIPE_LEN")
      Tested-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a7299a18
  4. Jun 20, 2023
  5. Jun 19, 2023