Skip to content
  1. Apr 09, 2021
    • Harshad Shirwadkar's avatar
      ext4: add proc files to monitor new structures · f68f4063
      Harshad Shirwadkar authored
      
      
      This patch adds a new file "mb_structs_summary" which allows us to see
      the summary of the new allocator structures added in this
      series. Here's the sample output of file:
      
      optimize_scan: 1
      max_free_order_lists:
              list_order_0_groups: 0
              list_order_1_groups: 0
              list_order_2_groups: 0
              list_order_3_groups: 0
              list_order_4_groups: 0
              list_order_5_groups: 0
              list_order_6_groups: 0
              list_order_7_groups: 0
              list_order_8_groups: 0
              list_order_9_groups: 0
              list_order_10_groups: 0
              list_order_11_groups: 0
              list_order_12_groups: 0
              list_order_13_groups: 40
      fragment_size_tree:
              tree_min: 16384
              tree_max: 32768
              tree_nodes: 40
      
      Signed-off-by: default avatarHarshad Shirwadkar <harshadshirwadkar@gmail.com>
      Reviewed-by: default avatarAndreas Dilger <adilger@dilger.ca>
      Reviewed-by: default avatarRitesh Harjani <ritesh.list@gmail.com>
      Link: https://lore.kernel.org/r/20210401172129.189766-7-harshadshirwadkar@gmail.com
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      f68f4063
    • Harshad Shirwadkar's avatar
      ext4: improve cr 0 / cr 1 group scanning · 196e402a
      Harshad Shirwadkar authored
      
      
      Instead of traversing through groups linearly, scan groups in specific
      orders at cr 0 and cr 1. At cr 0, we want to find groups that have the
      largest free order >= the order of the request. So, with this patch,
      we maintain lists for each possible order and insert each group into a
      list based on the largest free order in its buddy bitmap. During cr 0
      allocation, we traverse these lists in the increasing order of largest
      free orders. This allows us to find a group with the best available cr
      0 match in constant time. If nothing can be found, we fallback to cr 1
      immediately.
      
      At CR1, the story is slightly different. We want to traverse in the
      order of increasing average fragment size. For CR1, we maintain a rb
      tree of groupinfos which is sorted by average fragment size. Instead
      of traversing linearly, at CR1, we traverse in the order of increasing
      average fragment size, starting at the most optimal group. This brings
      down cr 1 search complexity to log(num groups).
      
      For cr >= 2, we just perform the linear search as before. Also, in
      case of lock contention, we intermittently fallback to linear search
      even in CR 0 and CR 1 cases. This allows us to proceed during the
      allocation path even in case of high contention.
      
      There is an opportunity to do optimization at CR2 too. That's because
      at CR2 we only consider groups where bb_free counter (number of free
      blocks) is greater than the request extent size. That's left as future
      work.
      
      All the changes introduced in this patch are protected under a new
      mount option "mb_optimize_scan".
      
      With this patchset, following experiment was performed:
      
      Created a highly fragmented disk of size 65TB. The disk had no
      contiguous 2M regions. Following command was run consecutively for 3
      times:
      
      time dd if=/dev/urandom of=file bs=2M count=10
      
      Here are the results with and without cr 0/1 optimizations introduced
      in this patch:
      
      |---------+------------------------------+---------------------------|
      |         | Without CR 0/1 Optimizations | With CR 0/1 Optimizations |
      |---------+------------------------------+---------------------------|
      | 1st run | 5m1.871s                     | 2m47.642s                 |
      | 2nd run | 2m28.390s                    | 0m0.611s                  |
      | 3rd run | 2m26.530s                    | 0m1.255s                  |
      |---------+------------------------------+---------------------------|
      
      Signed-off-by: default avatarHarshad Shirwadkar <harshadshirwadkar@gmail.com>
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Reported-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Reviewed-by: default avatarAndreas Dilger <adilger@dilger.ca>
      Link: https://lore.kernel.org/r/20210401172129.189766-6-harshadshirwadkar@gmail.com
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      196e402a
    • Harshad Shirwadkar's avatar
      ext4: add MB_NUM_ORDERS macro · 4b68f6df
      Harshad Shirwadkar authored
      
      
      A few arrays in mballoc.c use the total number of valid orders as
      their size. Currently, this value is set as "sb->s_blocksize_bits +
      2". This makes code harder to read. So, instead add a new macro
      MB_NUM_ORDERS(sb) to make the code more readable.
      
      Signed-off-by: default avatarHarshad Shirwadkar <harshadshirwadkar@gmail.com>
      Reviewed-by: default avatarAndreas Dilger <adilger@dilger.ca>
      Reviewed-by: default avatarRitesh Harjani <ritesh.list@gmail.com>
      Link: https://lore.kernel.org/r/20210401172129.189766-5-harshadshirwadkar@gmail.com
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      4b68f6df
    • Harshad Shirwadkar's avatar
      ext4: add mballoc stats proc file · a6c75eaf
      Harshad Shirwadkar authored
      
      
      Add new stats for measuring the performance of mballoc. This patch is
      forked from Artem Blagodarenko's work that can be found here:
      
      https://github.com/lustre/lustre-release/blob/master/ldiskfs/kernel_patches/patches/rhel8/ext4-simple-blockalloc.patch
      
      This patch reorganizes the stats by cr level. This is how the output
      looks like:
      
      mballoc:
      	reqs: 0
      	success: 0
      	groups_scanned: 0
      	cr0_stats:
      		hits: 0
      		groups_considered: 0
      		useless_loops: 0
      		bad_suggestions: 0
      	cr1_stats:
      		hits: 0
      		groups_considered: 0
      		useless_loops: 0
      		bad_suggestions: 0
      	cr2_stats:
      		hits: 0
      		groups_considered: 0
      		useless_loops: 0
      	cr3_stats:
      		hits: 0
      		groups_considered: 0
      		useless_loops: 0
      	extents_scanned: 0
      		goal_hits: 0
      		2^n_hits: 0
      		breaks: 0
      		lost: 0
      	buddies_generated: 0/40
      	buddies_time_used: 0
      	preallocated: 0
      	discarded: 0
      
      Signed-off-by: default avatarHarshad Shirwadkar <harshadshirwadkar@gmail.com>
      Reviewed-by: default avatarAndreas Dilger <adilger@dilger.ca>
      Reviewed-by: default avatarRitesh Harjani <ritesh.list@gmail.com>
      Link: https://lore.kernel.org/r/20210401172129.189766-4-harshadshirwadkar@gmail.com
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      a6c75eaf
    • Harshad Shirwadkar's avatar
      ext4: add ability to return parsed options from parse_options · b237e304
      Harshad Shirwadkar authored
      
      
      Before this patch, the function parse_options() was returning
      journal_devnum and journal_ioprio variables to the caller. This patch
      generalizes that interface to allow parse_options to return any parsed
      options to return back to the caller. In this patch series, it gets
      used to capture the value of "mb_optimize_scan=%u" mount option.
      
      Signed-off-by: default avatarHarshad Shirwadkar <harshadshirwadkar@gmail.com>
      Reviewed-by: default avatarRitesh Harjani <ritesh.list@gmail.com>
      Link: https://lore.kernel.org/r/20210401172129.189766-3-harshadshirwadkar@gmail.com
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      b237e304
    • Harshad Shirwadkar's avatar
      ext4: drop s_mb_bal_lock and convert protected fields to atomic · 67d25186
      Harshad Shirwadkar authored
      
      
      s_mb_buddies_generated gets used later in this patch series to
      determine if the cr 0 and cr 1 optimziations should be performed or
      not. Currently, s_mb_buddies_generated is protected under a
      spin_lock. In the allocation path, it is better if we don't depend on
      the lock and instead read the value atomically. In order to do that,
      we drop s_bal_lock altogether and we convert the only two protected
      fields by it s_mb_buddies_generated and s_mb_generation_time to atomic
      type.
      
      Signed-off-by: default avatarHarshad Shirwadkar <harshadshirwadkar@gmail.com>
      Reviewed-by: default avatarAndreas Dilger <adilger@dilger.ca>
      Reviewed-by: default avatarRitesh Harjani <ritesh.list@gmail.com>
      Link: https://lore.kernel.org/r/20210401172129.189766-2-harshadshirwadkar@gmail.com
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      67d25186
    • Zhang Yi's avatar
      ext4: fix check to prevent false positive report of incorrect used inodes · a149d2a5
      Zhang Yi authored
      Commit <50122847> ("ext4: fix check to prevent initializing reserved
      inodes") check the block group zero and prevent initializing reserved
      inodes. But in some special cases, the reserved inode may not all belong
      to the group zero, it may exist into the second group if we format
      filesystem below.
      
        mkfs.ext4 -b 4096 -g 8192 -N 1024 -I 4096 /dev/sda
      
      So, it will end up triggering a false positive report of a corrupted
      file system. This patch fix it by avoid check reserved inodes if no free
      inode blocks will be zeroed.
      
      Cc: stable@kernel.org
      Fixes: 50122847
      
       ("ext4: fix check to prevent initializing reserved inodes")
      Signed-off-by: default avatarZhang Yi <yi.zhang@huawei.com>
      Suggested-by: default avatarJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20210331121516.2243099-1-yi.zhang@huawei.com
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      a149d2a5
  2. Apr 06, 2021
    • Arnd Bergmann's avatar
      jbd2: avoid -Wempty-body warnings · d5564351
      Arnd Bergmann authored
      
      
      Building with 'make W=1' shows a harmless -Wempty-body warning:
      
      fs/jbd2/recovery.c: In function 'fc_do_one_pass':
      fs/jbd2/recovery.c:267:75: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body]
        267 |                 jbd_debug(3, "Fast commit replay failed, err = %d\n", err);
            |                                                                           ^
      
      Change the empty dprintk() macros to no_printk(), which avoids this
      warning and adds format string checking.
      
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20210322102152.95684-1-arnd@kernel.org
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      d5564351
    • Daniel Rosenberg's avatar
      ext4: optimize match for casefolded encrypted dirs · 1ae98e29
      Daniel Rosenberg authored
      
      
      Matching names with casefolded encrypting directories requires
      decrypting entries to confirm case since we are case preserving. We can
      avoid needing to decrypt if our hash values don't match.
      
      Signed-off-by: default avatarDaniel Rosenberg <drosen@google.com>
      Link: https://lore.kernel.org/r/20210319073414.1381041-3-drosen@google.com
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      1ae98e29
    • Daniel Rosenberg's avatar
      ext4: handle casefolding with encryption · 471fbbea
      Daniel Rosenberg authored
      
      
      This adds support for encryption with casefolding.
      
      Since the name on disk is case preserving, and also encrypted, we can no
      longer just recompute the hash on the fly. Additionally, to avoid
      leaking extra information from the hash of the unencrypted name, we use
      siphash via an fscrypt v2 policy.
      
      The hash is stored at the end of the directory entry for all entries
      inside of an encrypted and casefolded directory apart from those that
      deal with '.' and '..'. This way, the change is backwards compatible
      with existing ext4 filesystems.
      
      [ Changed to advertise this feature via the file:
        /sys/fs/ext4/features/encrypted_casefold -- TYT ]
      
      Signed-off-by: default avatarDaniel Rosenberg <drosen@google.com>
      Reviewed-by: default avatarAndreas Dilger <adilger@dilger.ca>
      Link: https://lore.kernel.org/r/20210319073414.1381041-2-drosen@google.com
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      471fbbea
  3. Apr 03, 2021
  4. Apr 02, 2021
  5. Mar 25, 2021
  6. Mar 22, 2021
    • Linus Torvalds's avatar
      Linux 5.12-rc4 · 0d02ec6b
      Linus Torvalds authored
      v5.12-rc4
      0d02ec6b
    • Linus Torvalds's avatar
      Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 · d7f5f1bd
      Linus Torvalds authored
      Pull ext4 fixes from Ted Ts'o:
       "Miscellaneous ext4 bug fixes for v5.12"
      
      * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
        ext4: initialize ret to suppress smatch warning
        ext4: stop inode update before return
        ext4: fix rename whiteout with fast commit
        ext4: fix timer use-after-free on failed mount
        ext4: fix potential error in ext4_do_update_inode
        ext4: do not try to set xattr into ea_inode if value is empty
        ext4: do not iput inode under running transaction in ext4_rename()
        ext4: find old entry again if failed to rename whiteout
        ext4: fix error handling in ext4_end_enable_verity()
        ext4: fix bh ref count on error paths
        fs/ext4: fix integer overflow in s_log_groups_per_flex
        ext4: add reclaim checks to xattr code
        ext4: shrink race window in ext4_should_retry_alloc()
      d7f5f1bd
    • Linus Torvalds's avatar
      Merge tag 'io_uring-5.12-2021-03-21' of git://git.kernel.dk/linux-block · 2c41fab1
      Linus Torvalds authored
      Pull io_uring followup fixes from Jens Axboe:
      
       - The SIGSTOP change from Eric, so we properly ignore that for
         PF_IO_WORKER threads.
      
       - Disallow sending signals to PF_IO_WORKER threads in general, we're
         not interested in having them funnel back to the io_uring owning
         task.
      
       - Stable fix from Stefan, ensuring we properly break links for short
         send/sendmsg recv/recvmsg if MSG_WAITALL is set.
      
       - Catch and loop when needing to run task_work before a PF_IO_WORKER
         threads goes to sleep.
      
      * tag 'io_uring-5.12-2021-03-21' of git://git.kernel.dk/linux-block:
        io_uring: call req_set_fail_links() on short send[msg]()/recv[msg]() with MSG_WAITALL
        io-wq: ensure task is running before processing task_work
        signal: don't allow STOP on PF_IO_WORKER threads
        signal: don't allow sending any signals to PF_IO_WORKER threads
      2c41fab1
    • Linus Torvalds's avatar
      Merge tag 'staging-5.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging · 1d4345eb
      Linus Torvalds authored
      Pull staging and IIO driver fixes from Greg KH:
       "Some small staging and IIO driver fixes:
      
         - MAINTAINERS changes for the move of the staging mailing list
      
         - comedi driver fixes to get request_irq() to work correctly
      
         - counter driver fixes for reported issues with iio devices
      
         - tiny iio driver fixes for reported issues.
      
        All of these have been in linux-next with no reported problems"
      
      * tag 'staging-5.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
        staging: vt665x: fix alignment constraints
        staging: comedi: cb_pcidas64: fix request_irq() warn
        staging: comedi: cb_pcidas: fix request_irq() warn
        MAINTAINERS: move the staging subsystem to lists.linux.dev
        MAINTAINERS: move some real subsystems off of the staging mailing list
        iio: gyro: mpu3050: Fix error handling in mpu3050_trigger_handler
        iio: hid-sensor-temperature: Fix issues of timestamp channel
        iio: hid-sensor-humidity: Fix alignment issue of timestamp channel
        counter: stm32-timer-cnt: fix ceiling miss-alignment with reload register
        counter: stm32-timer-cnt: fix ceiling write max value
        counter: stm32-timer-cnt: Report count function when SLAVE_MODE_DISABLED
        iio: adc: ab8500-gpadc: Fix off by 10 to 3
        iio:adc:stm32-adc: Add HAS_IOMEM dependency
        iio: adis16400: Fix an error code in adis16400_initial_setup()
        iio: adc: adi-axi-adc: add proper Kconfig dependencies
        iio: adc: ad7949: fix wrong ADC result due to incorrect bit mask
        iio: hid-sensor-prox: Fix scale not correct issue
        iio:adc:qcom-spmi-vadc: add default scale to LR_MUX2_BAT_ID channel
      1d4345eb
    • Linus Torvalds's avatar
      Merge tag 'usb-5.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb · 3001c355
      Linus Torvalds authored
      Pull USB and Thunderbolt driver fixes from Greg KH:
       "Here are some small Thunderbolt and USB driver fixes for some reported
        issues:
      
         - thunderbolt fixes for minor problems
      
         - typec fixes for power issues
      
         - usb-storage quirk addition
      
         - usbip bugfix
      
         - dwc3 bugfix when stopping transfers
      
         - cdnsp bugfix for isoc transfers
      
         - gadget use-after-free fix
      
        All have been in linux-next this week with no reported issues"
      
      * tag 'usb-5.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
        usb: typec: tcpm: Skip sink_cap query only when VDM sm is busy
        usb: dwc3: gadget: Prevent EP queuing while stopping transfers
        usb: typec: tcpm: Invoke power_supply_changed for tcpm-source-psy-
        usb: typec: Remove vdo[3] part of tps6598x_rx_identity_reg struct
        usb-storage: Add quirk to defeat Kindle's automatic unload
        usb: gadget: configfs: Fix KASAN use-after-free
        usbip: Fix incorrect double assignment to udc->ud.tcp_rx
        usb: cdnsp: Fixes incorrect value in ISOC TRB
        thunderbolt: Increase runtime PM reference count on DP tunnel discovery
        thunderbolt: Initialize HopID IDAs in tb_switch_alloc()
      3001c355
    • Linus Torvalds's avatar
      Merge tag 'irq-urgent-2021-03-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 5ee96fa9
      Linus Torvalds authored
      Pull irq fix from Ingo Molnar:
       "A change to robustify force-threaded IRQ handlers to always disable
        interrupts, plus a DocBook fix.
      
        The force-threaded IRQ handler change has been accelerated from the
        normal schedule of such a change to keep the bad pattern/workaround of
        spin_lock_irqsave() in handlers or IRQF_NOTHREAD as a kludge from
        spreading"
      
      * tag 'irq-urgent-2021-03-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        genirq: Disable interrupts for force threaded handlers
        genirq/irq_sim: Fix typos in kernel doc (fnode -> fwnode)
      5ee96fa9
    • Linus Torvalds's avatar
      Merge tag 'perf-urgent-2021-03-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 1c74516c
      Linus Torvalds authored
      Pull perf fixes from Ingo Molnar:
       "Boundary condition fixes for bugs unearthed by the perf fuzzer"
      
      * tag 'perf-urgent-2021-03-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        perf/x86/intel: Fix unchecked MSR access error caused by VLBR_EVENT
        perf/x86/intel: Fix a crash caused by zero PEBS status
      1c74516c
    • Linus Torvalds's avatar
      Merge tag 'locking-urgent-2021-03-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 5ba33b48
      Linus Torvalds authored
      Pull locking fixes from Ingo Molnar:
      
       - Get static calls & modules right. Hopefully.
      
       - WW mutex fixes
      
      * tag 'locking-urgent-2021-03-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        static_call: Fix static_call_update() sanity check
        static_call: Align static_call_is_init() patching condition
        static_call: Fix static_call_set_init()
        locking/ww_mutex: Fix acquire/release imbalance in ww_acquire_init()/ww_acquire_fini()
        locking/ww_mutex: Simplify use_ww_ctx & ww_ctx handling
      5ba33b48
    • Linus Torvalds's avatar
      Merge tag 'efi-urgent-2021-03-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 92ed88cb
      Linus Torvalds authored
      Pull EFI fixes from Ingo Molnar:
      
       - another missing RT_PROP table related fix, to ensure that the
         efivarfs pseudo filesystem fails gracefully if variable services
         are unsupported
      
       - use the correct alignment for literal EFI GUIDs
      
       - fix a use after unmap issue in the memreserve code
      
      * tag 'efi-urgent-2021-03-21' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        efi: use 32-bit alignment for efi_guid_t literals
        firmware/efi: Fix a use after bug in efi_mem_reserve_persistent
        efivars: respect EFI_UNSUPPORTED return from firmware
      92ed88cb
    • Linus Torvalds's avatar
      Merge tag 'x86_urgent_for_v5.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 5e3ddf96
      Linus Torvalds authored
      Pull x86 fixes from Borislav Petkov:
       "The freshest pile of shiny x86 fixes for 5.12:
      
         - Add the arch-specific mapping between physical and logical CPUs to
           fix devicetree-node lookups
      
         - Restore the IRQ2 ignore logic
      
         - Fix get_nr_restart_syscall() to return the correct restart syscall
           number. Split in a 4-patches set to avoid kABI breakage when
           backporting to dead kernels"
      
      * tag 'x86_urgent_for_v5.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/apic/of: Fix CPU devicetree-node lookups
        x86/ioapic: Ignore IRQ2 again
        x86: Introduce restart_block->arch_data to remove TS_COMPAT_RESTART
        x86: Introduce TS_COMPAT_RESTART to fix get_nr_restart_syscall()
        x86: Move TS_COMPAT back to asm/thread_info.h
        kernel, fs: Introduce and use set_restart_fn() and arch_set_restart_data()
      5e3ddf96
    • Linus Torvalds's avatar
      Merge tag 'powerpc-5.12-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux · b35660a7
      Linus Torvalds authored
      Pull powerpc fixes from Michael Ellerman:
      
       - Fix a possible stack corruption and subsequent DLPAR failure in the
         rpadlpar_io PCI hotplug driver
      
       - Two build fixes for uncommon configurations
      
      Thanks to Christophe Leroy and Tyrel Datwyler.
      
      * tag 'powerpc-5.12-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
        PCI: rpadlpar: Fix potential drc_name corruption in store functions
        powerpc: Force inlining of cpu_has_feature() to avoid build failure
        powerpc/vdso32: Add missing _restgpr_31_x to fix build failure
      b35660a7
  7. Mar 21, 2021
    • Stefan Metzmacher's avatar
      io_uring: call req_set_fail_links() on short send[msg]()/recv[msg]() with MSG_WAITALL · 0031275d
      Stefan Metzmacher authored
      
      
      Without that it's not safe to use them in a linked combination with
      others.
      
      Now combinations like IORING_OP_SENDMSG followed by IORING_OP_SPLICE
      should be possible.
      
      We already handle short reads and writes for the following opcodes:
      
      - IORING_OP_READV
      - IORING_OP_READ_FIXED
      - IORING_OP_READ
      - IORING_OP_WRITEV
      - IORING_OP_WRITE_FIXED
      - IORING_OP_WRITE
      - IORING_OP_SPLICE
      - IORING_OP_TEE
      
      Now we have it for these as well:
      
      - IORING_OP_SENDMSG
      - IORING_OP_SEND
      - IORING_OP_RECVMSG
      - IORING_OP_RECV
      
      For IORING_OP_RECVMSG we also check for the MSG_TRUNC and MSG_CTRUNC
      flags in order to call req_set_fail_links().
      
      There might be applications arround depending on the behavior
      that even short send[msg]()/recv[msg]() retuns continue an
      IOSQE_IO_LINK chain.
      
      It's very unlikely that such applications pass in MSG_WAITALL,
      which is only defined in 'man 2 recvmsg', but not in 'man 2 sendmsg'.
      
      It's expected that the low level sock_sendmsg() call just ignores
      MSG_WAITALL, as MSG_ZEROCOPY is also ignored without explicitly set
      SO_ZEROCOPY.
      
      We also expect the caller to know about the implicit truncation to
      MAX_RW_COUNT, which we don't detect.
      
      cc: netdev@vger.kernel.org
      Link: https://lore.kernel.org/r/c4e1a4cc0d905314f4d5dc567e65a7b09621aab3.1615908477.git.metze@samba.org
      Signed-off-by: default avatarStefan Metzmacher <metze@samba.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      0031275d
    • Jens Axboe's avatar
      io-wq: ensure task is running before processing task_work · 00ddff43
      Jens Axboe authored
      Mark the current task as running if we need to run task_work from the
      io-wq threads as part of work handling. If that is the case, then return
      as such so that the caller can appropriately loop back and reset if it
      was part of a going-to-sleep flush.
      
      Fixes: 3bfe6106
      
       ("io-wq: fork worker threads from original task")
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      00ddff43
    • Eric W. Biederman's avatar
      signal: don't allow STOP on PF_IO_WORKER threads · 4db4b1a0
      Eric W. Biederman authored
      
      
      Just like we don't allow normal signals to IO threads, don't deliver a
      STOP to a task that has PF_IO_WORKER set. The IO threads don't take
      signals in general, and have no means of flushing out a stop either.
      
      Longer term, we may want to look into allowing stop of these threads,
      as it relates to eg process freezing. For now, this prevents a spin
      issue if a SIGSTOP is delivered to the parent task.
      
      Reported-by: default avatarStefan Metzmacher <metze@samba.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      4db4b1a0
    • Jens Axboe's avatar
      signal: don't allow sending any signals to PF_IO_WORKER threads · 5be28c8f
      Jens Axboe authored
      
      
      They don't take signals individually, and even if they share signals with
      the parent task, don't allow them to be delivered through the worker
      thread. Linux does allow this kind of behavior for regular threads, but
      it's really a compatability thing that we need not care about for the IO
      threads.
      
      Reported-by: default avatarStefan Metzmacher <metze@samba.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      5be28c8f
    • Theodore Ts'o's avatar
      64395d95
    • Pan Bian's avatar
      ext4: stop inode update before return · 512c15ef
      Pan Bian authored
      
      
      The inode update should be stopped before returing the error code.
      
      Signed-off-by: default avatarPan Bian <bianpan2016@163.com>
      Link: https://lore.kernel.org/r/20210117085732.93788-1-bianpan2016@163.com
      Fixes: 8016e29f
      
       ("ext4: fast commit recovery path")
      Cc: stable@kernel.org
      Reviewed-by: default avatarHarshad Shirwadkar <harshadshirwadkar@gmail.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      512c15ef
    • Harshad Shirwadkar's avatar
      ext4: fix rename whiteout with fast commit · 8210bb29
      Harshad Shirwadkar authored
      This patch adds rename whiteout support in fast commits. Note that the
      whiteout object that gets created is actually char device. Which
      imples, the function ext4_inode_journal_mode(struct inode *inode)
      would return "JOURNAL_DATA" for this inode. This has a consequence in
      fast commit code that it will make creation of the whiteout object a
      fast-commit ineligible behavior and thus will fall back to full
      commits. With this patch, this can be observed by running fast commits
      with rename whiteout and seeing the stats generated by ext4_fc_stats
      tracepoint as follows:
      
      ext4_fc_stats: dev 254:32 fc ineligible reasons:
      XATTR:0, CROSS_RENAME:0, JOURNAL_FLAG_CHANGE:0, NO_MEM:0, SWAP_BOOT:0,
      RESIZE:0, RENAME_DIR:0, FALLOC_RANGE:0, INODE_JOURNAL_DATA:16;
      num_commits:6, ineligible: 6, numblks: 3
      
      So in short, this patch guarantees that in case of rename whiteout, we
      fall back to full commits.
      
      Amir mentioned that instead of creating a new whiteout object for
      every rename, we can create a static whiteout object with irrelevant
      nlink. That will make fast commits to not fall back to full
      commit. But until this happens, this patch will ensure correctness by
      falling back to full commits.
      
      Fixes: 8016e29f
      
       ("ext4: fast commit recovery path")
      Cc: stable@kernel.org
      Signed-off-by: default avatarHarshad Shirwadkar <harshadshirwadkar@gmail.com>
      Link: https://lore.kernel.org/r/20210316221921.1124955-1-harshadshirwadkar@gmail.com
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      8210bb29
    • Jan Kara's avatar
      ext4: fix timer use-after-free on failed mount · 2a4ae3bc
      Jan Kara authored
      
      
      When filesystem mount fails because of corrupted filesystem we first
      cancel the s_err_report timer reminding fs errors every day and only
      then we flush s_error_work. However s_error_work may report another fs
      error and re-arm timer thus resulting in timer use-after-free. Fix the
      problem by first flushing the work and only after that canceling the
      s_err_report timer.
      
      Reported-by: default avatar <syzbot+628472a2aac693ab0fcd@syzkaller.appspotmail.com>
      Fixes: 2d01ddc8
      
       ("ext4: save error info to sb through journal if available")
      CC: stable@vger.kernel.org
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20210315165906.2175-1-jack@suse.cz
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      2a4ae3bc
    • Shijie Luo's avatar
      ext4: fix potential error in ext4_do_update_inode · 7d8bd3c7
      Shijie Luo authored
      
      
      If set_large_file = 1 and errors occur in ext4_handle_dirty_metadata(),
      the error code will be overridden, go to out_brelse to avoid this
      situation.
      
      Signed-off-by: default avatarShijie Luo <luoshijie1@huawei.com>
      Link: https://lore.kernel.org/r/20210312065051.36314-1-luoshijie1@huawei.com
      Cc: stable@kernel.org
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      7d8bd3c7
    • zhangyi (F)'s avatar
      ext4: do not try to set xattr into ea_inode if value is empty · 6b224899
      zhangyi (F) authored
      
      
      Syzbot report a warning that ext4 may create an empty ea_inode if set
      an empty extent attribute to a file on the file system which is no free
      blocks left.
      
        WARNING: CPU: 6 PID: 10667 at fs/ext4/xattr.c:1640 ext4_xattr_set_entry+0x10f8/0x1114 fs/ext4/xattr.c:1640
        ...
        Call trace:
         ext4_xattr_set_entry+0x10f8/0x1114 fs/ext4/xattr.c:1640
         ext4_xattr_block_set+0x1d0/0x1b1c fs/ext4/xattr.c:1942
         ext4_xattr_set_handle+0x8a0/0xf1c fs/ext4/xattr.c:2390
         ext4_xattr_set+0x120/0x1f0 fs/ext4/xattr.c:2491
         ext4_xattr_trusted_set+0x48/0x5c fs/ext4/xattr_trusted.c:37
         __vfs_setxattr+0x208/0x23c fs/xattr.c:177
        ...
      
      Now, ext4 try to store extent attribute into an external inode if
      ext4_xattr_block_set() return -ENOSPC, but for the case of store an
      empty extent attribute, store the extent entry into the extent
      attribute block is enough. A simple reproduce below.
      
        fallocate test.img -l 1M
        mkfs.ext4 -F -b 2048 -O ea_inode test.img
        mount test.img /mnt
        dd if=/dev/zero of=/mnt/foo bs=2048 count=500
        setfattr -n "user.test" /mnt/foo
      
      Reported-by: default avatar <syzbot+98b881fdd8ebf45ab4ae@syzkaller.appspotmail.com>
      Fixes: 9c6e7853
      
       ("ext4: reserve space for xattr entries/names")
      Cc: stable@kernel.org
      Signed-off-by: default avatarzhangyi (F) <yi.zhang@huawei.com>
      Link: https://lore.kernel.org/r/20210305120508.298465-1-yi.zhang@huawei.com
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      6b224899
    • zhangyi (F)'s avatar
      ext4: do not iput inode under running transaction in ext4_rename() · 5dccdc5a
      zhangyi (F) authored
      In ext4_rename(), when RENAME_WHITEOUT failed to add new entry into
      directory, it ends up dropping new created whiteout inode under the
      running transaction. After commit <9b88f9fb
      
      > ("ext4: Do not iput inode
      under running transaction"), we follow the assumptions that evict() does
      not get called from a transaction context but in ext4_rename() it breaks
      this suggestion. Although it's not a real problem, better to obey it, so
      this patch add inode to orphan list and stop transaction before final
      iput().
      
      Signed-off-by: default avatarzhangyi (F) <yi.zhang@huawei.com>
      Link: https://lore.kernel.org/r/20210303131703.330415-2-yi.zhang@huawei.com
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      5dccdc5a
    • zhangyi (F)'s avatar
      ext4: find old entry again if failed to rename whiteout · b7ff91fd
      zhangyi (F) authored
      If we failed to add new entry on rename whiteout, we cannot reset the
      old->de entry directly, because the old->de could have moved from under
      us during make indexed dir. So find the old entry again before reset is
      needed, otherwise it may corrupt the filesystem as below.
      
        /dev/sda: Entry '00000001' in ??? (12) has deleted/unused inode 15. CLEARED.
        /dev/sda: Unattached inode 75
        /dev/sda: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
      
      Fixes: 6b4b8e6b
      
       ("ext4: fix bug for rename with RENAME_WHITEOUT")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarzhangyi (F) <yi.zhang@huawei.com>
      Link: https://lore.kernel.org/r/20210303131703.330415-1-yi.zhang@huawei.com
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      b7ff91fd
    • Thomas Gleixner's avatar
      genirq: Disable interrupts for force threaded handlers · 81e2073c
      Thomas Gleixner authored
      With interrupt force threading all device interrupt handlers are invoked
      from kernel threads. Contrary to hard interrupt context the invocation only
      disables bottom halfs, but not interrupts. This was an oversight back then
      because any code like this will have an issue:
      
      thread(irq_A)
        irq_handler(A)
          spin_lock(&foo->lock);
      
      interrupt(irq_B)
        irq_handler(B)
          spin_lock(&foo->lock);
      
      This has been triggered with networking (NAPI vs. hrtimers) and console
      drivers where printk() happens from an interrupt which interrupted the
      force threaded handler.
      
      Now people noticed and started to change the spin_lock() in the handler to
      spin_lock_irqsave() which affects performance or add IRQF_NOTHREAD to the
      interrupt request which in turn breaks RT.
      
      Fix the root cause and not the symptom and disable interrupts before
      invoking the force threaded handler which preserves the regular semantics
      and the usefulness of the interrupt force threading as a general debugging
      tool.
      
      For not RT this is not changing much, except that during the execution of
      the threaded handler interrupts are delayed until the handler
      returns. Vs. scheduling and softirq processing there is no difference.
      
      For RT kernels there is no issue.
      
      Fixes: 8d32a307
      
       ("genirq: Provide forced interrupt threading")
      Reported-by: default avatarJohan Hovold <johan@kernel.org>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarJohan Hovold <johan@kernel.org>
      Acked-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Link: https://lore.kernel.org/r/20210317143859.513307808@linutronix.de
      81e2073c