Skip to content
  1. Apr 10, 2021
  2. Apr 09, 2021
    • Harshad Shirwadkar's avatar
      ext4: make prefetch_block_bitmaps default · 21175ca4
      Harshad Shirwadkar authored
      
      
      Block bitmap prefetching is needed for these allocator optimization
      data structures to get populated and provide better group scanning
      order. So, turn it on bu default. prefetch_block_bitmaps mount option
      is now marked as removed and a new option no_prefetch_block_bitmaps is
      added to disable block bitmap prefetching.
      
      Signed-off-by: default avatarHarshad Shirwadkar <harshadshirwadkar@gmail.com>
      Link: https://lore.kernel.org/r/20210401172129.189766-8-harshadshirwadkar@gmail.com
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      21175ca4
    • Harshad Shirwadkar's avatar
      ext4: add proc files to monitor new structures · f68f4063
      Harshad Shirwadkar authored
      
      
      This patch adds a new file "mb_structs_summary" which allows us to see
      the summary of the new allocator structures added in this
      series. Here's the sample output of file:
      
      optimize_scan: 1
      max_free_order_lists:
              list_order_0_groups: 0
              list_order_1_groups: 0
              list_order_2_groups: 0
              list_order_3_groups: 0
              list_order_4_groups: 0
              list_order_5_groups: 0
              list_order_6_groups: 0
              list_order_7_groups: 0
              list_order_8_groups: 0
              list_order_9_groups: 0
              list_order_10_groups: 0
              list_order_11_groups: 0
              list_order_12_groups: 0
              list_order_13_groups: 40
      fragment_size_tree:
              tree_min: 16384
              tree_max: 32768
              tree_nodes: 40
      
      Signed-off-by: default avatarHarshad Shirwadkar <harshadshirwadkar@gmail.com>
      Reviewed-by: default avatarAndreas Dilger <adilger@dilger.ca>
      Reviewed-by: default avatarRitesh Harjani <ritesh.list@gmail.com>
      Link: https://lore.kernel.org/r/20210401172129.189766-7-harshadshirwadkar@gmail.com
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      f68f4063
    • Harshad Shirwadkar's avatar
      ext4: improve cr 0 / cr 1 group scanning · 196e402a
      Harshad Shirwadkar authored
      
      
      Instead of traversing through groups linearly, scan groups in specific
      orders at cr 0 and cr 1. At cr 0, we want to find groups that have the
      largest free order >= the order of the request. So, with this patch,
      we maintain lists for each possible order and insert each group into a
      list based on the largest free order in its buddy bitmap. During cr 0
      allocation, we traverse these lists in the increasing order of largest
      free orders. This allows us to find a group with the best available cr
      0 match in constant time. If nothing can be found, we fallback to cr 1
      immediately.
      
      At CR1, the story is slightly different. We want to traverse in the
      order of increasing average fragment size. For CR1, we maintain a rb
      tree of groupinfos which is sorted by average fragment size. Instead
      of traversing linearly, at CR1, we traverse in the order of increasing
      average fragment size, starting at the most optimal group. This brings
      down cr 1 search complexity to log(num groups).
      
      For cr >= 2, we just perform the linear search as before. Also, in
      case of lock contention, we intermittently fallback to linear search
      even in CR 0 and CR 1 cases. This allows us to proceed during the
      allocation path even in case of high contention.
      
      There is an opportunity to do optimization at CR2 too. That's because
      at CR2 we only consider groups where bb_free counter (number of free
      blocks) is greater than the request extent size. That's left as future
      work.
      
      All the changes introduced in this patch are protected under a new
      mount option "mb_optimize_scan".
      
      With this patchset, following experiment was performed:
      
      Created a highly fragmented disk of size 65TB. The disk had no
      contiguous 2M regions. Following command was run consecutively for 3
      times:
      
      time dd if=/dev/urandom of=file bs=2M count=10
      
      Here are the results with and without cr 0/1 optimizations introduced
      in this patch:
      
      |---------+------------------------------+---------------------------|
      |         | Without CR 0/1 Optimizations | With CR 0/1 Optimizations |
      |---------+------------------------------+---------------------------|
      | 1st run | 5m1.871s                     | 2m47.642s                 |
      | 2nd run | 2m28.390s                    | 0m0.611s                  |
      | 3rd run | 2m26.530s                    | 0m1.255s                  |
      |---------+------------------------------+---------------------------|
      
      Signed-off-by: default avatarHarshad Shirwadkar <harshadshirwadkar@gmail.com>
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Reported-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Reviewed-by: default avatarAndreas Dilger <adilger@dilger.ca>
      Link: https://lore.kernel.org/r/20210401172129.189766-6-harshadshirwadkar@gmail.com
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      196e402a
    • Harshad Shirwadkar's avatar
      ext4: add MB_NUM_ORDERS macro · 4b68f6df
      Harshad Shirwadkar authored
      
      
      A few arrays in mballoc.c use the total number of valid orders as
      their size. Currently, this value is set as "sb->s_blocksize_bits +
      2". This makes code harder to read. So, instead add a new macro
      MB_NUM_ORDERS(sb) to make the code more readable.
      
      Signed-off-by: default avatarHarshad Shirwadkar <harshadshirwadkar@gmail.com>
      Reviewed-by: default avatarAndreas Dilger <adilger@dilger.ca>
      Reviewed-by: default avatarRitesh Harjani <ritesh.list@gmail.com>
      Link: https://lore.kernel.org/r/20210401172129.189766-5-harshadshirwadkar@gmail.com
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      4b68f6df
    • Harshad Shirwadkar's avatar
      ext4: add mballoc stats proc file · a6c75eaf
      Harshad Shirwadkar authored
      
      
      Add new stats for measuring the performance of mballoc. This patch is
      forked from Artem Blagodarenko's work that can be found here:
      
      https://github.com/lustre/lustre-release/blob/master/ldiskfs/kernel_patches/patches/rhel8/ext4-simple-blockalloc.patch
      
      This patch reorganizes the stats by cr level. This is how the output
      looks like:
      
      mballoc:
      	reqs: 0
      	success: 0
      	groups_scanned: 0
      	cr0_stats:
      		hits: 0
      		groups_considered: 0
      		useless_loops: 0
      		bad_suggestions: 0
      	cr1_stats:
      		hits: 0
      		groups_considered: 0
      		useless_loops: 0
      		bad_suggestions: 0
      	cr2_stats:
      		hits: 0
      		groups_considered: 0
      		useless_loops: 0
      	cr3_stats:
      		hits: 0
      		groups_considered: 0
      		useless_loops: 0
      	extents_scanned: 0
      		goal_hits: 0
      		2^n_hits: 0
      		breaks: 0
      		lost: 0
      	buddies_generated: 0/40
      	buddies_time_used: 0
      	preallocated: 0
      	discarded: 0
      
      Signed-off-by: default avatarHarshad Shirwadkar <harshadshirwadkar@gmail.com>
      Reviewed-by: default avatarAndreas Dilger <adilger@dilger.ca>
      Reviewed-by: default avatarRitesh Harjani <ritesh.list@gmail.com>
      Link: https://lore.kernel.org/r/20210401172129.189766-4-harshadshirwadkar@gmail.com
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      a6c75eaf
    • Harshad Shirwadkar's avatar
      ext4: add ability to return parsed options from parse_options · b237e304
      Harshad Shirwadkar authored
      
      
      Before this patch, the function parse_options() was returning
      journal_devnum and journal_ioprio variables to the caller. This patch
      generalizes that interface to allow parse_options to return any parsed
      options to return back to the caller. In this patch series, it gets
      used to capture the value of "mb_optimize_scan=%u" mount option.
      
      Signed-off-by: default avatarHarshad Shirwadkar <harshadshirwadkar@gmail.com>
      Reviewed-by: default avatarRitesh Harjani <ritesh.list@gmail.com>
      Link: https://lore.kernel.org/r/20210401172129.189766-3-harshadshirwadkar@gmail.com
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      b237e304
    • Harshad Shirwadkar's avatar
      ext4: drop s_mb_bal_lock and convert protected fields to atomic · 67d25186
      Harshad Shirwadkar authored
      
      
      s_mb_buddies_generated gets used later in this patch series to
      determine if the cr 0 and cr 1 optimziations should be performed or
      not. Currently, s_mb_buddies_generated is protected under a
      spin_lock. In the allocation path, it is better if we don't depend on
      the lock and instead read the value atomically. In order to do that,
      we drop s_bal_lock altogether and we convert the only two protected
      fields by it s_mb_buddies_generated and s_mb_generation_time to atomic
      type.
      
      Signed-off-by: default avatarHarshad Shirwadkar <harshadshirwadkar@gmail.com>
      Reviewed-by: default avatarAndreas Dilger <adilger@dilger.ca>
      Reviewed-by: default avatarRitesh Harjani <ritesh.list@gmail.com>
      Link: https://lore.kernel.org/r/20210401172129.189766-2-harshadshirwadkar@gmail.com
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      67d25186
    • Zhang Yi's avatar
      ext4: fix check to prevent false positive report of incorrect used inodes · a149d2a5
      Zhang Yi authored
      Commit <50122847> ("ext4: fix check to prevent initializing reserved
      inodes") check the block group zero and prevent initializing reserved
      inodes. But in some special cases, the reserved inode may not all belong
      to the group zero, it may exist into the second group if we format
      filesystem below.
      
        mkfs.ext4 -b 4096 -g 8192 -N 1024 -I 4096 /dev/sda
      
      So, it will end up triggering a false positive report of a corrupted
      file system. This patch fix it by avoid check reserved inodes if no free
      inode blocks will be zeroed.
      
      Cc: stable@kernel.org
      Fixes: 50122847
      
       ("ext4: fix check to prevent initializing reserved inodes")
      Signed-off-by: default avatarZhang Yi <yi.zhang@huawei.com>
      Suggested-by: default avatarJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20210331121516.2243099-1-yi.zhang@huawei.com
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      a149d2a5
  3. Apr 06, 2021
    • Arnd Bergmann's avatar
      jbd2: avoid -Wempty-body warnings · d5564351
      Arnd Bergmann authored
      
      
      Building with 'make W=1' shows a harmless -Wempty-body warning:
      
      fs/jbd2/recovery.c: In function 'fc_do_one_pass':
      fs/jbd2/recovery.c:267:75: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body]
        267 |                 jbd_debug(3, "Fast commit replay failed, err = %d\n", err);
            |                                                                           ^
      
      Change the empty dprintk() macros to no_printk(), which avoids this
      warning and adds format string checking.
      
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20210322102152.95684-1-arnd@kernel.org
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      d5564351
    • Daniel Rosenberg's avatar
      ext4: optimize match for casefolded encrypted dirs · 1ae98e29
      Daniel Rosenberg authored
      
      
      Matching names with casefolded encrypting directories requires
      decrypting entries to confirm case since we are case preserving. We can
      avoid needing to decrypt if our hash values don't match.
      
      Signed-off-by: default avatarDaniel Rosenberg <drosen@google.com>
      Link: https://lore.kernel.org/r/20210319073414.1381041-3-drosen@google.com
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      1ae98e29
    • Daniel Rosenberg's avatar
      ext4: handle casefolding with encryption · 471fbbea
      Daniel Rosenberg authored
      
      
      This adds support for encryption with casefolding.
      
      Since the name on disk is case preserving, and also encrypted, we can no
      longer just recompute the hash on the fly. Additionally, to avoid
      leaking extra information from the hash of the unencrypted name, we use
      siphash via an fscrypt v2 policy.
      
      The hash is stored at the end of the directory entry for all entries
      inside of an encrypted and casefolded directory apart from those that
      deal with '.' and '..'. This way, the change is backwards compatible
      with existing ext4 filesystems.
      
      [ Changed to advertise this feature via the file:
        /sys/fs/ext4/features/encrypted_casefold -- TYT ]
      
      Signed-off-by: default avatarDaniel Rosenberg <drosen@google.com>
      Reviewed-by: default avatarAndreas Dilger <adilger@dilger.ca>
      Link: https://lore.kernel.org/r/20210319073414.1381041-2-drosen@google.com
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      471fbbea
  4. Apr 03, 2021
  5. Apr 02, 2021
  6. Mar 25, 2021
  7. Mar 22, 2021
    • Linus Torvalds's avatar
      Linux 5.12-rc4 · 0d02ec6b
      Linus Torvalds authored
      v5.12-rc4
      0d02ec6b
    • Linus Torvalds's avatar
      Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 · d7f5f1bd
      Linus Torvalds authored
      Pull ext4 fixes from Ted Ts'o:
       "Miscellaneous ext4 bug fixes for v5.12"
      
      * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
        ext4: initialize ret to suppress smatch warning
        ext4: stop inode update before return
        ext4: fix rename whiteout with fast commit
        ext4: fix timer use-after-free on failed mount
        ext4: fix potential error in ext4_do_update_inode
        ext4: do not try to set xattr into ea_inode if value is empty
        ext4: do not iput inode under running transaction in ext4_rename()
        ext4: find old entry again if failed to rename whiteout
        ext4: fix error handling in ext4_end_enable_verity()
        ext4: fix bh ref count on error paths
        fs/ext4: fix integer overflow in s_log_groups_per_flex
        ext4: add reclaim checks to xattr code
        ext4: shrink race window in ext4_should_retry_alloc()
      d7f5f1bd
    • Linus Torvalds's avatar
      Merge tag 'io_uring-5.12-2021-03-21' of git://git.kernel.dk/linux-block · 2c41fab1
      Linus Torvalds authored
      Pull io_uring followup fixes from Jens Axboe:
      
       - The SIGSTOP change from Eric, so we properly ignore that for
         PF_IO_WORKER threads.
      
       - Disallow sending signals to PF_IO_WORKER threads in general, we're
         not interested in having them funnel back to the io_uring owning
         task.
      
       - Stable fix from Stefan, ensuring we properly break links for short
         send/sendmsg recv/recvmsg if MSG_WAITALL is set.
      
       - Catch and loop when needing to run task_work before a PF_IO_WORKER
         threads goes to sleep.
      
      * tag 'io_uring-5.12-2021-03-21' of git://git.kernel.dk/linux-block:
        io_uring: call req_set_fail_links() on short send[msg]()/recv[msg]() with MSG_WAITALL
        io-wq: ensure task is running before processing task_work
        signal: don't allow STOP on PF_IO_WORKER threads
        signal: don't allow sending any signals to PF_IO_WORKER threads
      2c41fab1
    • Linus Torvalds's avatar
      Merge tag 'staging-5.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging · 1d4345eb
      Linus Torvalds authored
      Pull staging and IIO driver fixes from Greg KH:
       "Some small staging and IIO driver fixes:
      
         - MAINTAINERS changes for the move of the staging mailing list
      
         - comedi driver fixes to get request_irq() to work correctly
      
         - counter driver fixes for reported issues with iio devices
      
         - tiny iio driver fixes for reported issues.
      
        All of these have been in linux-next with no reported problems"
      
      * tag 'staging-5.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
        staging: vt665x: fix alignment constraints
        staging: comedi: cb_pcidas64: fix request_irq() warn
        staging: comedi: cb_pcidas: fix request_irq() warn
        MAINTAINERS: move the staging subsystem to lists.linux.dev
        MAINTAINERS: move some real subsystems off of the staging mailing list
        iio: gyro: mpu3050: Fix error handling in mpu3050_trigger_handler
        iio: hid-sensor-temperature: Fix issues of timestamp channel
        iio: hid-sensor-humidity: Fix alignment issue of timestamp channel
        counter: stm32-timer-cnt: fix ceiling miss-alignment with reload register
        counter: stm32-timer-cnt: fix ceiling write max value
        counter: stm32-timer-cnt: Report count function when SLAVE_MODE_DISABLED
        iio: adc: ab8500-gpadc: Fix off by 10 to 3
        iio:adc:stm32-adc: Add HAS_IOMEM dependency
        iio: adis16400: Fix an error code in adis16400_initial_setup()
        iio: adc: adi-axi-adc: add proper Kconfig dependencies
        iio: adc: ad7949: fix wrong ADC result due to incorrect bit mask
        iio: hid-sensor-prox: Fix scale not correct issue
        iio:adc:qcom-spmi-vadc: add default scale to LR_MUX2_BAT_ID channel
      1d4345eb
    • Linus Torvalds's avatar
      Merge tag 'usb-5.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb · 3001c355
      Linus Torvalds authored
      Pull USB and Thunderbolt driver fixes from Greg KH:
       "Here are some small Thunderbolt and USB driver fixes for some reported
        issues:
      
         - thunderbolt fixes for minor problems
      
         - typec fixes for power issues
      
         - usb-storage quirk addition
      
         - usbip bugfix
      
         - dwc3 bugfix when stopping transfers
      
         - cdnsp bugfix for isoc transfers
      
         - gadget use-after-free fix
      
        All have been in linux-next this week with no reported issues"
      
      * tag 'usb-5.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
        usb: typec: tcpm: Skip sink_cap query only when VDM sm is busy
        usb: dwc3: gadget: Prevent EP queuing while stopping transfers
        usb: typec: tcpm: Invoke power_supply_changed for tcpm-source-psy-
        usb: typec: Remove vdo[3] part of tps6598x_rx_identity_reg struct
        usb-storage: Add quirk to defeat Kindle's automatic unload
        usb: gadget: configfs: Fix KASAN use-after-free
        usbip: Fix incorrect double assignment to udc->ud.tcp_rx
        usb: cdnsp: Fixes incorrect value in ISOC TRB
        thunderbolt: Increase runtime PM reference count on DP tunnel discovery
        thunderbolt: Initialize HopID IDAs in tb_switch_alloc()
      3001c355
    • Linus Torvalds's avatar
      Merge tag 'irq-urgent-2021-03-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 5ee96fa9
      Linus Torvalds authored
      Pull irq fix from Ingo Molnar:
       "A change to robustify force-threaded IRQ handlers to always disable
        interrupts, plus a DocBook fix.
      
        The force-threaded IRQ handler change has been accelerated from the
        normal schedule of such a change to keep the bad pattern/workaround of
        spin_lock_irqsave() in handlers or IRQF_NOTHREAD as a kludge from
        spreading"
      
      * tag 'irq-urgent-2021-03-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        genirq: Disable interrupts for force threaded handlers
        genirq/irq_sim: Fix typos in kernel doc (fnode -> fwnode)
      5ee96fa9
    • Linus Torvalds's avatar
      Merge tag 'perf-urgent-2021-03-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 1c74516c
      Linus Torvalds authored
      Pull perf fixes from Ingo Molnar:
       "Boundary condition fixes for bugs unearthed by the perf fuzzer"
      
      * tag 'perf-urgent-2021-03-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        perf/x86/intel: Fix unchecked MSR access error caused by VLBR_EVENT
        perf/x86/intel: Fix a crash caused by zero PEBS status
      1c74516c
    • Linus Torvalds's avatar
      Merge tag 'locking-urgent-2021-03-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 5ba33b48
      Linus Torvalds authored
      Pull locking fixes from Ingo Molnar:
      
       - Get static calls & modules right. Hopefully.
      
       - WW mutex fixes
      
      * tag 'locking-urgent-2021-03-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        static_call: Fix static_call_update() sanity check
        static_call: Align static_call_is_init() patching condition
        static_call: Fix static_call_set_init()
        locking/ww_mutex: Fix acquire/release imbalance in ww_acquire_init()/ww_acquire_fini()
        locking/ww_mutex: Simplify use_ww_ctx & ww_ctx handling
      5ba33b48
    • Linus Torvalds's avatar
      Merge tag 'efi-urgent-2021-03-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 92ed88cb
      Linus Torvalds authored
      Pull EFI fixes from Ingo Molnar:
      
       - another missing RT_PROP table related fix, to ensure that the
         efivarfs pseudo filesystem fails gracefully if variable services
         are unsupported
      
       - use the correct alignment for literal EFI GUIDs
      
       - fix a use after unmap issue in the memreserve code
      
      * tag 'efi-urgent-2021-03-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        efi: use 32-bit alignment for efi_guid_t literals
        firmware/efi: Fix a use after bug in efi_mem_reserve_persistent
        efivars: respect EFI_UNSUPPORTED return from firmware
      92ed88cb
    • Linus Torvalds's avatar
      Merge tag 'x86_urgent_for_v5.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 5e3ddf96
      Linus Torvalds authored
      Pull x86 fixes from Borislav Petkov:
       "The freshest pile of shiny x86 fixes for 5.12:
      
         - Add the arch-specific mapping between physical and logical CPUs to
           fix devicetree-node lookups
      
         - Restore the IRQ2 ignore logic
      
         - Fix get_nr_restart_syscall() to return the correct restart syscall
           number. Split in a 4-patches set to avoid kABI breakage when
           backporting to dead kernels"
      
      * tag 'x86_urgent_for_v5.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/apic/of: Fix CPU devicetree-node lookups
        x86/ioapic: Ignore IRQ2 again
        x86: Introduce restart_block->arch_data to remove TS_COMPAT_RESTART
        x86: Introduce TS_COMPAT_RESTART to fix get_nr_restart_syscall()
        x86: Move TS_COMPAT back to asm/thread_info.h
        kernel, fs: Introduce and use set_restart_fn() and arch_set_restart_data()
      5e3ddf96
    • Linus Torvalds's avatar
      Merge tag 'powerpc-5.12-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux · b35660a7
      Linus Torvalds authored
      Pull powerpc fixes from Michael Ellerman:
      
       - Fix a possible stack corruption and subsequent DLPAR failure in the
         rpadlpar_io PCI hotplug driver
      
       - Two build fixes for uncommon configurations
      
      Thanks to Christophe Leroy and Tyrel Datwyler.
      
      * tag 'powerpc-5.12-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
        PCI: rpadlpar: Fix potential drc_name corruption in store functions
        powerpc: Force inlining of cpu_has_feature() to avoid build failure
        powerpc/vdso32: Add missing _restgpr_31_x to fix build failure
      b35660a7
  8. Mar 21, 2021
    • Stefan Metzmacher's avatar
      io_uring: call req_set_fail_links() on short send[msg]()/recv[msg]() with MSG_WAITALL · 0031275d
      Stefan Metzmacher authored
      
      
      Without that it's not safe to use them in a linked combination with
      others.
      
      Now combinations like IORING_OP_SENDMSG followed by IORING_OP_SPLICE
      should be possible.
      
      We already handle short reads and writes for the following opcodes:
      
      - IORING_OP_READV
      - IORING_OP_READ_FIXED
      - IORING_OP_READ
      - IORING_OP_WRITEV
      - IORING_OP_WRITE_FIXED
      - IORING_OP_WRITE
      - IORING_OP_SPLICE
      - IORING_OP_TEE
      
      Now we have it for these as well:
      
      - IORING_OP_SENDMSG
      - IORING_OP_SEND
      - IORING_OP_RECVMSG
      - IORING_OP_RECV
      
      For IORING_OP_RECVMSG we also check for the MSG_TRUNC and MSG_CTRUNC
      flags in order to call req_set_fail_links().
      
      There might be applications arround depending on the behavior
      that even short send[msg]()/recv[msg]() retuns continue an
      IOSQE_IO_LINK chain.
      
      It's very unlikely that such applications pass in MSG_WAITALL,
      which is only defined in 'man 2 recvmsg', but not in 'man 2 sendmsg'.
      
      It's expected that the low level sock_sendmsg() call just ignores
      MSG_WAITALL, as MSG_ZEROCOPY is also ignored without explicitly set
      SO_ZEROCOPY.
      
      We also expect the caller to know about the implicit truncation to
      MAX_RW_COUNT, which we don't detect.
      
      cc: netdev@vger.kernel.org
      Link: https://lore.kernel.org/r/c4e1a4cc0d905314f4d5dc567e65a7b09621aab3.1615908477.git.metze@samba.org
      Signed-off-by: default avatarStefan Metzmacher <metze@samba.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      0031275d
    • Jens Axboe's avatar
      io-wq: ensure task is running before processing task_work · 00ddff43
      Jens Axboe authored
      Mark the current task as running if we need to run task_work from the
      io-wq threads as part of work handling. If that is the case, then return
      as such so that the caller can appropriately loop back and reset if it
      was part of a going-to-sleep flush.
      
      Fixes: 3bfe6106
      
       ("io-wq: fork worker threads from original task")
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      00ddff43
    • Eric W. Biederman's avatar
      signal: don't allow STOP on PF_IO_WORKER threads · 4db4b1a0
      Eric W. Biederman authored
      
      
      Just like we don't allow normal signals to IO threads, don't deliver a
      STOP to a task that has PF_IO_WORKER set. The IO threads don't take
      signals in general, and have no means of flushing out a stop either.
      
      Longer term, we may want to look into allowing stop of these threads,
      as it relates to eg process freezing. For now, this prevents a spin
      issue if a SIGSTOP is delivered to the parent task.
      
      Reported-by: default avatarStefan Metzmacher <metze@samba.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      4db4b1a0
    • Jens Axboe's avatar
      signal: don't allow sending any signals to PF_IO_WORKER threads · 5be28c8f
      Jens Axboe authored
      
      
      They don't take signals individually, and even if they share signals with
      the parent task, don't allow them to be delivered through the worker
      thread. Linux does allow this kind of behavior for regular threads, but
      it's really a compatability thing that we need not care about for the IO
      threads.
      
      Reported-by: default avatarStefan Metzmacher <metze@samba.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      5be28c8f
    • Theodore Ts'o's avatar
      64395d95