Skip to content
  1. Jul 08, 2023
    • Andres Freund's avatar
      io_uring: Use io_schedule* in cqring wait · 8a796565
      Andres Freund authored
      
      
      I observed poor performance of io_uring compared to synchronous IO. That
      turns out to be caused by deeper CPU idle states entered with io_uring,
      due to io_uring using plain schedule(), whereas synchronous IO uses
      io_schedule().
      
      The losses due to this are substantial. On my cascade lake workstation,
      t/io_uring from the fio repository e.g. yields regressions between 20%
      and 40% with the following command:
      ./t/io_uring -r 5 -X0 -d 1 -s 1 -c 1 -p 0 -S$use_sync -R 0 /mnt/t2/fio/write.0.0
      
      This is repeatable with different filesystems, using raw block devices
      and using different block devices.
      
      Use io_schedule_prepare() / io_schedule_finish() in
      io_cqring_wait_schedule() to address the difference.
      
      After that using io_uring is on par or surpassing synchronous IO (using
      registered files etc makes it reliably win, but arguably is a less fair
      comparison).
      
      There are other calls to schedule() in io_uring/, but none immediately
      jump out to be similarly situated, so I did not touch them. Similarly,
      it's possible that mutex_lock_io() should be used, but it's not clear if
      there are cases where that matters.
      
      Cc: stable@vger.kernel.org # 5.10+
      Cc: Pavel Begunkov <asml.silence@gmail.com>
      Cc: io-uring@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarAndres Freund <andres@anarazel.de>
      Link: https://lore.kernel.org/r/20230707162007.194068-1-andres@anarazel.de
      
      
      [axboe: minor style fixup]
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8a796565
  2. Jun 29, 2023
    • Jens Axboe's avatar
      io_uring: flush offloaded and delayed task_work on exit · dfbe5561
      Jens Axboe authored
      
      
      io_uring offloads task_work for cancelation purposes when the task is
      exiting. This is conceptually fine, but we should be nicer and actually
      wait for that work to complete before returning.
      
      Add an argument to io_fallback_tw() telling it to flush the deferred
      work when it's all queued up, and have it flush a ctx behind whenever
      the ctx changes.
      
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      dfbe5561
  3. Jun 28, 2023
  4. Jun 27, 2023
    • Linus Torvalds's avatar
      Merge tag 'tag-chrome-platform-for-v6.5' of... · 1ef6663a
      Linus Torvalds authored
      Merge tag 'tag-chrome-platform-for-v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/chrome-platform/linux
      
      Pull chrome platform updates from Tzung-Bi Shih:
       "Improvements:
      
         - Support Pin Assignment D in getting mux state
      
         - Emit an uevent when EC panics so that userland programs get chance
           to capture EC coredumps (LPC interface only)
      
         - Send EC_CMD_HOST_SLEEP_EVENT to EC at the very beginning/end of
           system suspend/resume so that EC can watch the duration more
           accurately (LPC interface only)
      
        Misc:
      
         - Switch back from I2C .probe_new() to .probe()
      
         - Use %*ph for printing hexdump of small buffers"
      
      * tag 'tag-chrome-platform-for-v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/chrome-platform/linux:
        platform/chrome: cros_ec_spi: Use %*ph for printing hexdump of a small buffer
        platform/chrome: Switch i2c drivers back to use .probe()
        platform/chrome: cros_ec_lpc: Move host command to prepare/complete
        platform/chrome: cros_ec: Report EC panic as uevent
        platform/chrome: cros_typec_switch: Add Pin D support
      1ef6663a
    • Linus Torvalds's avatar
      Merge tag 'thermal-6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · 8d7868c4
      Linus Torvalds authored
      Pull thermal control updates from Rafael Wysocki:
       "These extend the int340x thermal driver, add thermal DT bindings for
        some Qcom platforms, add DT bindings and support for Armada AP807 and
        MSM8909, allow selecting the bang-bang thermal governor as the default
        one, address issues in several thermal drivers for ARM platforms and
        clean up code.
      
        Specifics:
      
         - Add new IOCTLs to the int340x thermal driver to allow user space to
           retrieve the Passive v2 thermal table (Srinivas Pandruvada)
      
         - Add DT bindings for SM6375, MSM8226 and QCM2290 Qcom platforms
           (Konrad Dybcio)
      
         - Add DT bindings and support for QCom MSM8226 (Matti Lehtimäki)
      
         - Add DT bindings for QCom ipq9574 (Praveenkumar I)
      
         - Convert bcm2835 DT bindings to the yaml schema (Stefan Wahren)
      
         - Allow selecting the bang-bang governor as default (Thierry Reding)
      
         - Refactor and prepare the code to set the scene for RCar Gen4
           (Wolfram Sang)
      
         - Clean up and fix the QCom tsens drivers. Add DT bindings and
           calibration for the MSM8909 platform (Stephan Gerhold)
      
         - Revert a patch introducing a wrong usage of devm_of_iomap() on the
           Mediatek platform (Ricardo Cañuelo)
      
         - Fix the clock vs reset ordering in order to conform to the
           documentation on the sun8i (Christophe JAILLET)
      
         - Prevent setting up undocumented registers, enable the only
           described sensors and add the version 2.1 on the Qoriq sensor (Peng
           Fan)
      
         - Add DT bindings and support for the Armada AP807 (Alex Leibovich)
      
         - Update the mlx5 driver with the recent thermal changes (Daniel
           Lezcano)
      
         - Convert to platform remove callback returning void on STM32 (Uwe
           Kleine-König)
      
         - Add an error information printing for devm_thermal_add_hwmon_sysfs()
           and remove the error from the Sun8i, Amlogic, i.MX, TI, K3, Tegra,
           Qoriq, Mediateka and QCom (Yangtao Li)
      
         - Register as hwmon sensor for the Generic ADC (Chen-Yu Tsai)
      
         - Use the dev_err_probe() function in the QCom tsens alarm driver
           (Luca Weiss)"
      
      * tag 'thermal-6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (39 commits)
        thermal/drivers/qcom/temp-alarm: Use dev_err_probe
        thermal/drivers/generic-adc: Register thermal zones as hwmon sensors
        thermal/drivers/mediatek/lvts_thermal: Remove redundant msg in lvts_ctrl_start()
        thermal/drivers/qcom: Remove redundant msg at probe time
        thermal/drivers/ti-soc: Remove redundant msg in ti_thermal_expose_sensor()
        thermal/drivers/qoriq: Remove redundant msg in qoriq_tmu_register_tmu_zone()
        thermal/drivers/tegra: Remove redundant msg in tegra_tsensor_register_channel()
        drivers/thermal/k3: Remove redundant msg in k3_bandgap_probe()
        thermal/drivers/imx: Remove redundant msg in imx8mm_tmu_probe() and imx_sc_thermal_probe()
        thermal/drivers/amlogic: Remove redundant msg in amlogic_thermal_probe()
        thermal/drivers/sun8i: Remove redundant msg in sun8i_ths_register()
        thermal/hwmon: Add error information printing for devm_thermal_add_hwmon_sysfs()
        thermal/drivers/stm32: Convert to platform remove callback returning void
        net/mlx5: Update the driver with the recent thermal changes
        thermal/drivers/armada: Add support for AP807 thermal data
        dt-bindings: armada-thermal: Add armada-ap807-thermal compatible
        thermal/drivers/qoriq: Support version 2.1
        thermal/drivers/qoriq: Only enable supported sensors
        thermal/drivers/qoriq: No need to program site adjustment register
        thermal/drivers/mediatek/lvts_thermal: Register thermal zones as hwmon sensors
        ...
      8d7868c4
    • Linus Torvalds's avatar
      Merge tag 'pm-6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · 40e8e98f
      Linus Torvalds authored
      Pull power management updates from Rafael Wysocki:
       "These add Intel TPMI (Topology Aware Register and PM Capsule
        Interface) support to the power capping subsystem, extend the
        intel_idle driver to work in VM guests where MWAIT is not available,
        extend the system-wide power management diagnostics, fix bugs and
        clean up code.
      
        Specifics:
      
         - Introduce power capping core support for Intel TPMI (Topology Aware
           Register and PM Capsule Interface) and a TPMI interface driver for
           Intel RAPL (Zhang Rui, Dan Carpenter)
      
         - Fix CONFIG_IOSF_MBI dependency in the Intel RAPL power capping
           driver (Zhang Rui)
      
         - Fix invalid initialization for pl4_supported field in the Intel
           RAPL power capping driver (Sumeet Pawnikar)
      
         - Clean up the intel_idle driver, make it work with VM guests that
           cannot use the MWAIT instruction and address the case in which the
           host may enter a deep idle state when the guest is idle (Arjan van
           de Ven)
      
         - Prevent cpufreq drivers that provide the ->adjust_perf() callback
           without a ->fast_switch() one which is used as a fallback from the
           former in some cases (Wyes Karny)
      
         - Fix some issues related to the AMD P-state cpufreq driver (Mario
           Limonciello, Wyes Karny)
      
         - Fix the energy_performance_preference attribute handling in the
           intel_pstate driver in passive mode (Tero Kristo)
      
         - Fix the handling of pm_suspend_target_state when CONFIG_PM is unset
           (Kai-Heng Feng)
      
         - Correct spelling mistake in a comment in the hibernation code (Wang
           Honghui)
      
         - Add arch_resume_nosmt() prototype to avoid a "missing prototypes"
           build warning (Arnd Bergmann)
      
         - Restrict pm_pr_dbg() to system-wide power transitions and use it in
           a few additional places (Mario Limonciello)
      
         - Drop verification of in-params from genpd_add_device() and ensure
           that all of its callers will do it (Ulf Hansson)
      
         - Prevent possible integer overflows from occurring in
           genpd_parse_state() (Nikita Zhandarovich)
      
         - Reorder fieldls in 'struct devfreq_dev_status' to reduce its size
           somewhat (Christophe JAILLET)
      
         - Ensure that the Exynos PPMU driver is already loaded before the
           Exynos Bus driver starts probing so as to avoid a possible freeze
           loading of the kernel modules (Marek Szyprowski)
      
         - Fix variable deferencing before NULL check in the mtk-cci devfreq
           driver (Sukrut Bellary)"
      
      * tag 'pm-6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (42 commits)
        intel_idle: Add a "Long HLT" C1 state for the VM guest mode
        cpufreq: intel_pstate: Fix energy_performance_preference for passive
        cpufreq: amd-pstate: Add a kernel config option to set default mode
        cpufreq: amd-pstate: Set a fallback policy based on preferred_profile
        ACPI: CPPC: Add definition for undefined FADT preferred PM profile value
        cpufreq: amd-pstate: Set default governor to schedutil
        PM: domains: Move the verification of in-params from genpd_add_device()
        cpufreq: amd-pstate: Make amd-pstate EPP driver name hyphenated
        cpufreq: amd-pstate: Write CPPC enable bit per-socket
        intel_idle: Add support for using intel_idle in a VM guest using just hlt
        cpufreq: Fail driver register if it has adjust_perf without fast_switch
        intel_idle: clean up the (new) state_update_enter_method function
        intel_idle: refactor state->enter manipulation into its own function
        platform/x86/amd: pmc: Use pm_pr_dbg() for suspend related messages
        pinctrl: amd: Use pm_pr_dbg to show debugging messages
        ACPI: x86: Add pm_debug_messages for LPS0 _DSM state tracking
        include/linux/suspend.h: Only show pm_pr_dbg messages at suspend/resume
        powercap: RAPL: Fix a NULL vs IS_ERR() bug
        powercap: RAPL: Fix CONFIG_IOSF_MBI dependency
        powercap: RAPL: fix invalid initialization for pl4_supported field
        ...
      40e8e98f
    • Linus Torvalds's avatar
      Merge tag 'acpi-6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · bb695055
      Linus Torvalds authored
      Pull ACPI updates from Rafael Wysocki:
       "These rework the handling of notifications in ACPI button drivers (to
        enable future simplifications and cleanups), clean up the ACPI thermal
        driver, update the ACPI backlight driver, add quirks working around
        AML bugs on some systems, fix some assorted issues and clean up code.
      
        Specifics:
      
         - Reduce ACPI device enumeration overhead related to devices with
           dependencies (Rafael Wysocki)
      
         - Fix the handling of Microsoft LPS0 _DSM for suspend-to-idle (Mario
           Limonciello)
      
         - Fix section mismatch warning in the ACPI suspend-to-idle code (Arnd
           Bergmann)
      
         - Drop several ACPI resource management quirks related to IRQ
           ovverides on AMD "Zen" systems (Mario Limonciello)
      
         - Modify the ACPI EC driver to make it only clear the EC GPE status
           when handling the GPE (Jeremy Compostella)
      
         - Add quirks to work around ACPI tables defects on Lenovo Yoga Book
           yb1-x90f/l and Nextbook Ares 8A (Hans de Goede)
      
         - Add ACPi backlight quirks for Dell Studio 1569, Lenovo ThinkPad
           X131e (3371 AMD version) and Apple iMac11,3 and stop trying to use
           vendor backlight control on relatively recent systems (Hans de
           Goede)
      
         - Add pwm_lookup_table entry for second PWM on CHT/BSW devices in the
           ACPI LPSS (Intel SoC) driver (Hans de Goede)
      
         - Add nfit_intel_shutdown_status() declaration to a local header to
           avoid a "missing prototypes" build warning (Arnd Bergmann)
      
         - Clean up the ACPI thermal driver and drop some dead or otherwise
           unneded code from it (Rafael Wysocki)
      
         - Rework the handling of notifications in the ACPI button drivers so
           as to allow the common notification handling code for devices to be
           simplified (Rafael Wysocki)
      
         - Make ghes_get_devices() return NULL to indicate that there are no
           GHES devices so as to allow vendor-specific EDAC drivers to probe
           then (Li Yang)
      
         - Mark bert_disable() as __initdata and drop an unused function from
           the APEI GHES code (Miaohe Lin)
      
         - Make the ACPI PAD (Processor Aggregator Device) driver realize that
           Zhaoxin CPUs support nonstop TSC (Tony W Wang-oc)
      
         - Drop the certainly unnecessary and likely incorrect inclusion of
           linux/arm-smccc.h from acpi_ffh.c (Sudeep Holla)"
      
      * tag 'acpi-6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (30 commits)
        ACPI: video: Add backlight=native DMI quirk for Dell Studio 1569
        ACPI: thermal: Drop struct acpi_thermal_flags
        ACPI: thermal: Drop struct acpi_thermal_state
        ACPI: bus: Simplify installation and removal of notify callback
        ACPI: tiny-power-button: Eliminate the driver notify callback
        ACPI: button: Use different notify handlers for lid and buttons
        ACPI: button: Eliminate the driver notify callback
        ACPI: thermal: Eliminate struct acpi_thermal_state_flags
        ACPI: thermal: Move acpi_thermal_driver definition
        ACPI: thermal: Move symbol definitions to one place
        ACPI: thermal: Drop redundant ACPI_TRIPS_REFRESH_DEVICES symbol
        ACPI: thermal: Use BIT() macro for defining flags
        APEI: GHES: correctly return NULL for ghes_get_devices()
        ACPI: FFH: Drop the inclusion of linux/arm-smccc.h
        ACPI: PAD: mark Zhaoxin CPUs NONSTOP TSC correctly
        ACPI: APEI: mark bert_disable as __initdata
        ACPI: EC: Clear GPE on interrupt handling only
        ACPI: video: Stop trying to use vendor backlight control on laptops from after ~2012
        ACPI: x86: s2idle: Adjust Microsoft LPS0 _DSM handling sequence
        ACPI: resource: Remove "Zen" specific match and quirks
        ...
      bb695055
    • Linus Torvalds's avatar
      Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux · 2605e80d
      Linus Torvalds authored
      Pull arm64 updates from Catalin Marinas:
       "Notable features are user-space support for the memcpy/memset
        instructions and the permission indirection extension.
      
         - Support for the Armv8.9 Permission Indirection Extensions. While
           this feature doesn't add new functionality, it enables future
           support for Guarded Control Stacks (GCS) and Permission Overlays
      
         - User-space support for the Armv8.8 memcpy/memset instructions
      
         - arm64 perf: support the HiSilicon SoC uncore PMU, Arm CMN sysfs
           identifier, support for the NXP i.MX9 SoC DDRC PMU, fixes and
           cleanups
      
         - Removal of superfluous ISBs on context switch (following
           retrospective architecture tightening)
      
         - Decode the ISS2 register during faults for additional information
           to help with debugging
      
         - KPTI clean-up/simplification of the trampoline exit code
      
         - Addressing several -Wmissing-prototype warnings
      
         - Kselftest improvements for signal handling and ptrace
      
         - Fix TPIDR2_EL0 restoring on sigreturn
      
         - Clean-up, robustness improvements of the module allocation code
      
         - More sysreg conversions to the automatic register/bitfields
           generation
      
         - CPU capabilities handling cleanup
      
         - Arm documentation updates: ACPI, ptdump"
      
      * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (124 commits)
        kselftest/arm64: Add a test case for TPIDR2 restore
        arm64/signal: Restore TPIDR2 register rather than memory state
        arm64: alternatives: make clean_dcache_range_nopatch() noinstr-safe
        Documentation/arm64: Add ptdump documentation
        arm64: hibernate: remove WARN_ON in save_processor_state
        kselftest/arm64: Log signal code and address for unexpected signals
        docs: perf: Fix warning from 'make htmldocs' in hisi-pmu.rst
        arm64/fpsimd: Exit streaming mode when flushing tasks
        docs: perf: Add new description for HiSilicon UC PMU
        drivers/perf: hisi: Add support for HiSilicon UC PMU driver
        drivers/perf: hisi: Add support for HiSilicon H60PA and PAv3 PMU driver
        perf: arm_cspmu: Add missing MODULE_DEVICE_TABLE
        perf/arm-cmn: Add sysfs identifier
        perf/arm-cmn: Revamp model detection
        perf/arm_dmc620: Add cpumask
        arm64: mm: fix VA-range sanity check
        arm64/mm: remove now-superfluous ISBs from TTBR writes
        Documentation/arm64: Update ACPI tables from BBR
        Documentation/arm64: Update references in arm-acpi
        Documentation/arm64: Update ARM and arch reference
        ...
      2605e80d
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm · 2b603cd5
      Linus Torvalds authored
      Pull ARM updates from Russell King:
      
       - lots of build cleanups from Arnd spread throughout the arch/arm tree
      
       - replace strlcpy() with the preferred strscpy()
      
       - use sign_extend32() in the module linker
      
       - drop handle_irq() machine descriptor method
      
      * tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm:
        ARM: 9315/1: fiq: include asm/mach/irq.h for prototypes
        ARM: 9314/1: tcm: move tcm_init() prototype to asm/tcm.h
        ARM: 9313/1: vdso: add missing prototypes
        ARM: 9312/1: vfp: include asm/neon.h in vfpmodule.c
        ARM: 9311/1: decompressor: move function prototypes to misc.h
        ARM: 9310/1: xip-kernel: add __inflate_kernel_data prototype
        ARM: 9309/1: add missing syscall prototypes
        ARM: 9308/1: move setup functions to header
        ARM: 9307/1: nommu: include asm/idmap.h
        ARM: 9306/1: cacheflush: avoid __flush_anon_page() missing-prototype warning
        ARM: 9305/1: add clear/copy_user_highpage declarations
        ARM: 9304/1: add prototype for function called only from asm
        ARM: 9303/1: kprobes: avoid missing-declaration warnings
        ARM: 9302/1: traps: hide unused functions on NOMMU
        ARM: 9301/1: dma-mapping: hide unused dma_contiguous_early_fixup function
        ARM: 9300/1: Replace all non-returning strlcpy with strscpy
        ARM: 9299/1: module: use sign_extend32() to extend the signedness
        ARM: 9298/1: Drop custom mdesc->handle_irq()
      2b603cd5
    • Linus Torvalds's avatar
      Merge tag 'm68k-for-v6.5-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k · f810c182
      Linus Torvalds authored
      Pull m68k updates from Geert Uytterhoeven:
      
        - miscellaneous NuBus fixes and improvements
      
        - defconfig updates
      
      * tag 'm68k-for-v6.5-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k:
        m68k: defconfig: Update defconfigs for v6.4-rc1
        nubus: Don't list slot resources by default
        nubus: Remove proc entries before adding them
        nubus: Partially revert proc_create_single_data() conversion
      f810c182
    • Linus Torvalds's avatar
      Merge tag 'x86_cleanups_for_6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 19300488
      Linus Torvalds authored
      Pull x86 cleanups from Dave Hansen:
       "As usual, these are all over the map. The biggest cluster is work from
        Arnd to eliminate -Wmissing-prototype warnings:
      
         - Address -Wmissing-prototype warnings
      
         - Remove repeated 'the' in comments
      
         - Remove unused current_untag_mask()
      
         - Document urgent tip branch timing
      
         - Clean up MSR kernel-doc notation
      
         - Clean up paravirt_ops doc
      
         - Update Srivatsa S. Bhat's maintained areas
      
         - Remove unused extern declaration acpi_copy_wakeup_routine()"
      
      * tag 'x86_cleanups_for_6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (22 commits)
        x86/acpi: Remove unused extern declaration acpi_copy_wakeup_routine()
        Documentation: virt: Clean up paravirt_ops doc
        x86/mm: Remove unused current_untag_mask()
        x86/mm: Remove repeated word in comments
        x86/lib/msr: Clean up kernel-doc notation
        x86/platform: Avoid missing-prototype warnings for OLPC
        x86/mm: Add early_memremap_pgprot_adjust() prototype
        x86/usercopy: Include arch_wb_cache_pmem() declaration
        x86/vdso: Include vdso/processor.h
        x86/mce: Add copy_mc_fragile_handle_tail() prototype
        x86/fbdev: Include asm/fb.h as needed
        x86/hibernate: Declare global functions in suspend.h
        x86/entry: Add do_SYSENTER_32() prototype
        x86/quirks: Include linux/pnp.h for arch_pnpbios_disabled()
        x86/mm: Include asm/numa.h for set_highmem_pages_init()
        x86: Avoid missing-prototype warnings for doublefault code
        x86/fpu: Include asm/fpu/regset.h
        x86: Add dummy prototype for mk_early_pgtbl_32()
        x86/pci: Mark local functions as 'static'
        x86/ftrace: Move prepare_ftrace_return prototype to header
        ...
      19300488
    • Linus Torvalds's avatar
      Merge tag 'x86_tdx_for_6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 5dfe7a7e
      Linus Torvalds authored
      Pull x86 tdx updates from Dave Hansen:
      
       - Fix a race window where load_unaligned_zeropad() could cause a fatal
         shutdown during TDX private<=>shared conversion
      
         The race has never been observed in practice but might allow
         load_unaligned_zeropad() to catch a TDX page in the middle of its
         conversion process which would lead to a fatal and unrecoverable
         guest shutdown.
      
       - Annotate sites where VM "exit reasons" are reused as hypercall
         numbers.
      
      * tag 'x86_tdx_for_6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/mm: Fix enc_status_change_finish_noop()
        x86/tdx: Fix race between set_memory_encrypted() and load_unaligned_zeropad()
        x86/mm: Allow guest.enc_status_change_prepare() to fail
        x86/tdx: Wrap exit reason with hcall_func()
      5dfe7a7e
    • Linus Torvalds's avatar
      Merge tag 'x86_platform_for_6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 36db3144
      Linus Torvalds authored
      Pull x86 platform updates from Dave Hansen:
       "Allow CPUs in SGX/HPE Ultraviolet to start using Sub-NUMA clustering
        (SNC) mode. SNC has been around outside the UV world for a while but
        evidently never worked on UV systems.
      
        SNC is rather notorious for breaking bad assumptions of a 1:1
        relationship between physical sockets and NUMA nodes. The UV code was
        rather prolific with these assumptions and took quite a bit of
        refactoring to remove them"
      
      * tag 'x86_platform_for_6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/platform/uv: Update UV[23] platform code for SNC
        x86/platform/uv: Remove remaining BUG_ON() and BUG() calls
        x86/platform/uv: UV support for sub-NUMA clustering
        x86/platform/uv: Helper functions for allocating and freeing conversion tables
        x86/platform/uv: When searching for minimums, start at INT_MAX not 99999
        x86/platform/uv: Fix printed information in calc_mmioh_map
        x86/platform/uv: Introduce helper function uv_pnode_to_socket.
        x86/platform/uv: Add platform resolving #defines for misc GAM_MMIOH_REDIRECT*
      36db3144
    • Linus Torvalds's avatar
      Merge tag 'x86_irq_for_6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · a3d763f0
      Linus Torvalds authored
      Pull x86 irq updates from Dave Hansen:
       "Add Hyper-V interrupts to /proc/stat"
      
      * tag 'x86_irq_for_6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/irq: Add hardcoded hypervisor interrupts to /proc/stat
      a3d763f0
    • Linus Torvalds's avatar
      Merge tag 'x86_cpu_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 941d77c7
      Linus Torvalds authored
      Pull x86 cpu updates from Borislav Petkov:
      
       - Compute the purposeful misalignment of zen_untrain_ret automatically
         and assert __x86_return_thunk's alignment so that future changes to
         the symbol macros do not accidentally break them.
      
       - Remove CONFIG_X86_FEATURE_NAMES Kconfig option as its existence is
         pointless
      
      * tag 'x86_cpu_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/retbleed: Add __x86_return_thunk alignment checks
        x86/cpu: Remove X86_FEATURE_NAMES
        x86/Kconfig: Make X86_FEATURE_NAMES non-configurable in prompt
      941d77c7
    • Linus Torvalds's avatar
      Merge tag 'x86_cc_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 2c96136a
      Linus Torvalds authored
      Pull x86 confidential computing update from Borislav Petkov:
      
       - Add support for unaccepted memory as specified in the UEFI spec v2.9.
      
         The gist of it all is that Intel TDX and AMD SEV-SNP confidential
         computing guests define the notion of accepting memory before using
         it and thus preventing a whole set of attacks against such guests
         like memory replay and the like.
      
         There are a couple of strategies of how memory should be accepted -
         the current implementation does an on-demand way of accepting.
      
      * tag 'x86_cc_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        virt: sevguest: Add CONFIG_CRYPTO dependency
        x86/efi: Safely enable unaccepted memory in UEFI
        x86/sev: Add SNP-specific unaccepted memory support
        x86/sev: Use large PSC requests if applicable
        x86/sev: Allow for use of the early boot GHCB for PSC requests
        x86/sev: Put PSC struct on the stack in prep for unaccepted memory support
        x86/sev: Fix calculation of end address based on number of pages
        x86/tdx: Add unaccepted memory support
        x86/tdx: Refactor try_accept_one()
        x86/tdx: Make _tdx_hypercall() and __tdx_module_call() available in boot stub
        efi/unaccepted: Avoid load_unaligned_zeropad() stepping into unaccepted memory
        efi: Add unaccepted memory support
        x86/boot/compressed: Handle unaccepted memory
        efi/libstub: Implement support for unaccepted memory
        efi/x86: Get full memory map in allocate_e820()
        mm: Add support for unaccepted memory
      2c96136a
    • Linus Torvalds's avatar
      Merge tag 'x86_cache_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 3e5822e0
      Linus Torvalds authored
      Pull x86 resource control updates from Borislav Petkov:
      
       - Implement a rename operation in resctrlfs to facilitate handling of
         application containers with dynamically changing task lists
      
       - When reading the tasks file, show the tasks' pid which are only in
         the current namespace as opposed to showing the pids from the init
         namespace too
      
       - Other fixes and improvements
      
      * tag 'x86_cache_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        Documentation/x86: Documentation for MON group move feature
        x86/resctrl: Implement rename op for mon groups
        x86/resctrl: Factor rdtgroup lock for multi-file ops
        x86/resctrl: Only show tasks' pid in current pid namespace
      3e5822e0
    • Linus Torvalds's avatar
      Merge tag 'x86_build_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 59035135
      Linus Torvalds authored
      Pull x86 build update from Borislav Petkov:
      
       - Remove relocation information from vmlinux as it is not needed by
         other tooling and thus a slimmer binary is generated.
      
         This is important for distros who have to distribute vmlinux blobs
         with their kernel packages too and that extraneous unnecessary data
         bloats them for no good reason
      
      * tag 'x86_build_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/build: Avoid relocation information in final vmlinux
      59035135
    • Linus Torvalds's avatar
      Merge tag 'x86_alternatives_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 8c69e7af
      Linus Torvalds authored
      Pull x86 instruction alternatives updates from Borislav Petkov:
      
       - Up until now the Fast Short Rep Mov optimizations implied the
         presence of the ERMS CPUID flag. AMD decoupled them with a BIOS
         setting so decouple that dependency in the kernel code too
      
       - Teach the alternatives machinery to handle relocations
      
       - Make debug_alternative accept flags in order to see only that set of
         patching done one is interested in
      
       - Other fixes, cleanups and optimizations to the patching code
      
      * tag 'x86_alternatives_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/alternative: PAUSE is not a NOP
        x86/alternatives: Add cond_resched() to text_poke_bp_batch()
        x86/nospec: Shorten RESET_CALL_DEPTH
        x86/alternatives: Add longer 64-bit NOPs
        x86/alternatives: Fix section mismatch warnings
        x86/alternative: Optimize returns patching
        x86/alternative: Complicate optimize_nops() some more
        x86/alternative: Rewrite optimize_nops() some
        x86/lib/memmove: Decouple ERMS from FSRM
        x86/alternative: Support relocations in alternatives
        x86/alternative: Make debug-alternative selective
      8c69e7af
    • Linus Torvalds's avatar
      Merge tag 'ras_core_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · aa35a483
      Linus Torvalds authored
      Pull RAS updates from Borislav Petkov:
      
       - Add initial support for RAS hardware found on AMD server GPUs (MI200).
      
         Those GPUs and CPUs are connected together through the coherent
         fabric and the GPU memory controllers report errors through x86's MCA
         so EDAC needs to support them. The amd64_edac driver supports now HBM
         (High Bandwidth Memory) and thus such heterogeneous memory controller
         systems
      
       - Other small cleanups and improvements
      
      * tag 'ras_core_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        EDAC/amd64: Cache and use GPU node map
        EDAC/amd64: Add support for AMD heterogeneous Family 19h Model 30h-3Fh
        EDAC/amd64: Document heterogeneous system enumeration
        x86/MCE/AMD, EDAC/mce_amd: Decode UMC_V2 ECC errors
        x86/amd_nb: Re-sort and re-indent PCI defines
        x86/amd_nb: Add MI200 PCI IDs
        ras/debugfs: Fix error checking for debugfs_create_dir()
        x86/MCE: Check a hw error's address to determine proper recovery action
      aa35a483
    • Linus Torvalds's avatar
      Merge tag 'edac_updates_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras · e5ce2f19
      Linus Torvalds authored
      Pull EDAC updates from Borislav Petkov:
      
       - amd64_edac: Add support for Zen4 client hardware
      
       - amd64_edac: Remove the version string as it is useless and actively
         confusing when looking at backported versions of the driver
      
       - Add a driver for the Nuvoton NPCM memory controller
      
       - A debugfs error checking cleanup
      
      * tag 'edac_updates_for_v6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
        EDAC/npcm: Add NPCM memory controller driver
        dt-bindings: memory-controllers: nuvoton: Add NPCM memory controller
        EDAC/thunderx: Check debugfs file creation retval properly
        EDAC/amd64: Add support for ECC on family 19h model 60h-7Fh
        EDAC/amd64: Remove module version string
      e5ce2f19
    • Linus Torvalds's avatar
      Merge tag 'x86-core-2023-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip · 88afbb21
      Linus Torvalds authored
      Pull x86 core updates from Thomas Gleixner:
       "A set of fixes for kexec(), reboot and shutdown issues:
      
         - Ensure that the WBINVD in stop_this_cpu() has been completed before
           the control CPU proceedes.
      
           stop_this_cpu() is used for kexec(), reboot and shutdown to park
           the APs in a HLT loop.
      
           The control CPU sends an IPI to the APs and waits for their CPU
           online bits to be cleared. Once they all are marked "offline" it
           proceeds.
      
           But stop_this_cpu() clears the CPU online bit before issuing
           WBINVD, which means there is no guarantee that the AP has reached
           the HLT loop.
      
           This was reported to cause intermittent reboot/shutdown failures
           due to some dubious interaction with the firmware.
      
           This is not only a problem of WBINVD. The code to actually "stop"
           the CPU which runs between clearing the online bit and reaching the
           HLT loop can cause large enough delays on its own (think
           virtualization). That's especially dangerous for kexec() as kexec()
           expects that all APs are in a safe state and not executing code
           while the boot CPU jumps to the new kernel. There are more issues
           vs kexec() which are addressed separately.
      
           Cure this by implementing an explicit synchronization point right
           before the AP reaches HLT. This guarantees that the AP has
           completed the full stop proceedure.
      
         - Fix the condition for WBINVD in stop_this_cpu().
      
           The WBINVD in stop_this_cpu() is required for ensuring that when
           switching to or from memory encryption no dirty data is left in the
           cache lines which might cause a write back in the wrong more later.
      
           This checks CPUID directly because the feature bit might have been
           cleared due to a command line option.
      
           But that CPUID check accesses leaf 0x8000001f::EAX unconditionally.
           Intel CPUs return the content of the highest supported leaf when a
           non-existing leaf is read, while AMD CPUs return all zeros for
           unsupported leafs.
      
           So the result of the test on Intel CPUs is lottery and on AMD its
           just correct by chance.
      
           While harmless it's incorrect and causes the conditional wbinvd()
           to be issued where not required, which caused the above issue to be
           unearthed.
      
         - Make kexec() robust against AP code execution
      
           Ashok observed triple faults when doing kexec() on a system which
           had been booted with "nosmt".
      
           It turned out that the SMT siblings which had been brought up
           partially are parked in mwait_play_dead() to enable power savings.
      
           mwait_play_dead() is monitoring the thread flags of the AP's idle
           task, which has been chosen as it's unlikely to be written to.
      
           But kexec() can overwrite the previous kernel text and data
           including page tables etc. When it overwrites the cache lines
           monitored by an AP that AP resumes execution after the MWAIT on
           eventually overwritten text, stack and page tables, which obviously
           might end up in a triple fault easily.
      
           Make this more robust in several steps:
      
            1) Use an explicit per CPU cache line for monitoring.
      
            2) Write a command to these cache lines to kick APs out of MWAIT
               before proceeding with kexec(), shutdown or reboot.
      
               The APs confirm the wakeup by writing status back and then
               enter a HLT loop.
      
            3) If the system uses INIT/INIT/STARTUP for AP bringup, park the
               APs in INIT state.
      
               HLT is not a guarantee that an AP won't wake up and resume
               execution. HLT is woken up by NMI and SMI. SMI puts the CPU
               back into HLT (+/- firmware bugs), but NMI is delivered to the
               CPU which executes the NMI handler. Same issue as the MWAIT
               scenario described above.
      
               Sending an INIT/INIT sequence to the APs puts them into wait
               for STARTUP state, which is safe against NMI.
      
           There is still an issue remaining which can't be fixed: #MCE
      
           If the AP sits in HLT and receives a broadcast #MCE it will try to
           handle it with the obvious consequences.
      
           INIT/INIT clears CR4.MCE in the AP which will cause a broadcast
           #MCE to shut down the machine.
      
           So there is a choice between fire (HLT) and frying pan (INIT).
           Frying pan has been chosen as it's at least preventing the NMI
           issue.
      
           On systems which are not using INIT/INIT/STARTUP there is not much
           which can be done right now, but at least the obvious and easy to
           trigger MWAIT issue has been addressed"
      
      * tag 'x86-core-2023-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/smp: Put CPUs into INIT on shutdown if possible
        x86/smp: Split sending INIT IPI out into a helper function
        x86/smp: Cure kexec() vs. mwait_play_dead() breakage
        x86/smp: Use dedicated cache-line for mwait_play_dead()
        x86/smp: Remove pointless wmb()s from native_stop_other_cpus()
        x86/smp: Dont access non-existing CPUID leaf
        x86/smp: Make stop_other_cpus() more robust
      88afbb21
    • Linus Torvalds's avatar
      Merge tag 'timers-core-2023-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip · cd336f65
      Linus Torvalds authored
      Pull timer updates from Thomas Gleixner:
       "Time, timekeeping and related device driver updates:
      
        Core:
      
         - A set of fixes, cleanups and enhancements to the posix timer code:
      
             - Prevent another possible live lock scenario in the exit() path,
               which affects POSIX_CPU_TIMERS_TASK_WORK enabled architectures.
      
             - Fix a loop termination issue which was reported syzcaller/KSAN
               in the posix timer ID allocation code.
      
               That triggered a deeper look into the posix-timer code which
               unearthed more small issues.
      
             - Add missing READ/WRITE_ONCE() annotations
      
             - Fix or remove completely outdated comments
      
             - Document places which are subtle and completely undocumented.
      
         - Add missing hrtimer modes to the trace event decoder
      
         - Small cleanups and enhancements all over the place
      
        Drivers:
      
         - Rework the Hyper-V clocksource and sched clock setup code
      
         - Remove a deprecated clocksource driver
      
         - Small fixes and enhancements all over the place"
      
      * tag 'timers-core-2023-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip: (39 commits)
        clocksource/drivers/cadence-ttc: Fix memory leak in ttc_timer_probe
        dt-bindings: timers: Add Ralink SoCs timer
        clocksource/drivers/hyper-v: Rework clocksource and sched clock setup
        dt-bindings: timer: brcm,kona-timer: convert to YAML
        clocksource/drivers/imx-gpt: Fold <soc/imx/timer.h> into its only user
        clk: imx: Drop inclusion of unused header <soc/imx/timer.h>
        hrtimer: Add missing sparse annotations to hrtimer locking
        clocksource/drivers/imx-gpt: Use only a single name for functions
        clocksource/drivers/loongson1: Move PWM timer to clocksource framework
        dt-bindings: timer: Add Loongson-1 clocksource
        MIPS: Loongson32: Remove deprecated PWM timer clocksource
        clocksource/drivers/ingenic-timer: Use pm_sleep_ptr() macro
        tracing/timer: Add missing hrtimer modes to decode_hrtimer_mode().
        posix-timers: Add sys_ni_posix_timers() prototype
        tick/rcu: Fix bogus ratelimit condition
        alarmtimer: Remove unnecessary (void *) cast
        alarmtimer: Remove unnecessary initialization of variable 'ret'
        posix-timers: Refer properly to CONFIG_HIGH_RES_TIMERS
        posix-timers: Polish coding style in a few places
        posix-timers: Remove pointless comments
        ...
      cd336f65
    • Linus Torvalds's avatar
      Merge tag 'smp-core-2023-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip · 9244724f
      Linus Torvalds authored
      Pull SMP updates from Thomas Gleixner:
       "A large update for SMP management:
      
         - Parallel CPU bringup
      
           The reason why people are interested in parallel bringup is to
           shorten the (kexec) reboot time of cloud servers to reduce the
           downtime of the VM tenants.
      
           The current fully serialized bringup does the following per AP:
      
             1) Prepare callbacks (allocate, intialize, create threads)
             2) Kick the AP alive (e.g. INIT/SIPI on x86)
             3) Wait for the AP to report alive state
             4) Let the AP continue through the atomic bringup
             5) Let the AP run the threaded bringup to full online state
      
           There are two significant delays:
      
             #3 The time for an AP to report alive state in start_secondary()
                on x86 has been measured in the range between 350us and 3.5ms
                depending on vendor and CPU type, BIOS microcode size etc.
      
             #4 The atomic bringup does the microcode update. This has been
                measured to take up to ~8ms on the primary threads depending
                on the microcode patch size to apply.
      
           On a two socket SKL server with 56 cores (112 threads) the boot CPU
           spends on current mainline about 800ms busy waiting for the APs to
           come up and apply microcode. That's more than 80% of the actual
           onlining procedure.
      
           This can be reduced significantly by splitting the bringup
           mechanism into two parts:
      
             1) Run the prepare callbacks and kick the AP alive for each AP
                which needs to be brought up.
      
                The APs wake up, do their firmware initialization and run the
                low level kernel startup code including microcode loading in
                parallel up to the first synchronization point. (#1 and #2
                above)
      
             2) Run the rest of the bringup code strictly serialized per CPU
                (#3 - #5 above) as it's done today.
      
                Parallelizing that stage of the CPU bringup might be possible
                in theory, but it's questionable whether required surgery
                would be justified for a pretty small gain.
      
           If the system is large enough the first AP is already waiting at
           the first synchronization point when the boot CPU finished the
           wake-up of the last AP. That reduces the AP bringup time on that
           SKL from ~800ms to ~80ms, i.e. by a factor ~10x.
      
           The actual gain varies wildly depending on the system, CPU,
           microcode patch size and other factors. There are some
           opportunities to reduce the overhead further, but that needs some
           deep surgery in the x86 CPU bringup code.
      
           For now this is only enabled on x86, but the core functionality
           obviously works for all SMP capable architectures.
      
         - Enhancements for SMP function call tracing so it is possible to
           locate the scheduling and the actual execution points. That allows
           to measure IPI delivery time precisely"
      
      * tag 'smp-core-2023-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip: (45 commits)
        trace,smp: Add tracepoints for scheduling remotelly called functions
        trace,smp: Add tracepoints around remotelly called functions
        MAINTAINERS: Add CPU HOTPLUG entry
        x86/smpboot: Fix the parallel bringup decision
        x86/realmode: Make stack lock work in trampoline_compat()
        x86/smp: Initialize cpu_primary_thread_mask late
        cpu/hotplug: Fix off by one in cpuhp_bringup_mask()
        x86/apic: Fix use of X{,2}APIC_ENABLE in asm with older binutils
        x86/smpboot/64: Implement arch_cpuhp_init_parallel_bringup() and enable it
        x86/smpboot: Support parallel startup of secondary CPUs
        x86/smpboot: Implement a bit spinlock to protect the realmode stack
        x86/apic: Save the APIC virtual base address
        cpu/hotplug: Allow "parallel" bringup up to CPUHP_BP_KICK_AP_STATE
        x86/apic: Provide cpu_primary_thread mask
        x86/smpboot: Enable split CPU startup
        cpu/hotplug: Provide a split up CPUHP_BRINGUP mechanism
        cpu/hotplug: Reset task stack state in _cpu_up()
        cpu/hotplug: Remove unused state functions
        riscv: Switch to hotplug core state synchronization
        parisc: Switch to hotplug core state synchronization
        ...
      9244724f
    • Linus Torvalds's avatar
      Merge tag 'x86-boot-2023-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip · 7cffdbe3
      Linus Torvalds authored
      Pull x86 boot updates from Thomas Gleixner:
       "Initialize FPU late.
      
        Right now FPU is initialized very early during boot. There is no real
        requirement to do so. The only requirement is to have it done before
        alternatives are patched.
      
        That's done in check_bugs() which does way more than what the function
        name suggests.
      
        So first rename check_bugs() to arch_cpu_finalize_init() which makes
        it clear what this is about.
      
        Move the invocation of arch_cpu_finalize_init() earlier in
        start_kernel() as it has to be done before fork_init() which needs to
        know the FPU register buffer size.
      
        With those prerequisites the FPU initialization can be moved into
        arch_cpu_finalize_init(), which removes it from the early and fragile
        part of the x86 bringup"
      
      * tag 'x86-boot-2023-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/mem_encrypt: Unbreak the AMD_MEM_ENCRYPT=n build
        x86/fpu: Move FPU initialization into arch_cpu_finalize_init()
        x86/fpu: Mark init functions __init
        x86/fpu: Remove cpuinfo argument from init functions
        x86/init: Initialize signal frame size late
        init, x86: Move mem_encrypt_init() into arch_cpu_finalize_init()
        init: Invoke arch_cpu_finalize_init() earlier
        init: Remove check_bugs() leftovers
        um/cpu: Switch to arch_cpu_finalize_init()
        sparc/cpu: Switch to arch_cpu_finalize_init()
        sh/cpu: Switch to arch_cpu_finalize_init()
        mips/cpu: Switch to arch_cpu_finalize_init()
        m68k/cpu: Switch to arch_cpu_finalize_init()
        loongarch/cpu: Switch to arch_cpu_finalize_init()
        ia64/cpu: Switch to arch_cpu_finalize_init()
        ARM: cpu: Switch to arch_cpu_finalize_init()
        x86/cpu: Switch to arch_cpu_finalize_init()
        init: Provide arch_cpu_finalize_init()
      7cffdbe3
    • Linus Torvalds's avatar
      Merge tag 'irq-core-2023-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip · 00173879
      Linus Torvalds authored
      Pull irq updates from Thomas Gleixner:
       "Updates for the interrupt subsystem:
      
        Core:
      
         - Convert the interrupt descriptor storage to a maple tree to
           overcome the limitations of the radixtree + fixed size bitmap.
      
           This allows us to handle very large servers with a huge number of
           guests without imposing a huge memory overhead on everyone
      
         - Implement optional retriggering of interrupts which utilize the
           fasteoi handler to work around a GICv3 architecture issue
      
        Drivers:
      
         - A set of fixes and updates for the Loongson/Loongarch related
           drivers
      
         - Workaound for an ASR8601 integration hickup which ends up with CPU
           numbering which can't be represented in the GIC implementation
      
         - The usual set of boring fixes and updates all over the place"
      
      * tag 'irq-core-2023-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
        Revert "irqchip/mxs: Include linux/irqchip/mxs.h"
        irqchip/jcore-aic: Fix missing allocation of IRQ descriptors
        irqchip/stm32-exti: Fix warning on initialized field overwritten
        irqchip/stm32-exti: Add STM32MP15xx IWDG2 EXTI to GIC map
        irqchip/gicv3: Add a iort_pmsi_get_dev_id() prototype
        irqchip/mxs: Include linux/irqchip/mxs.h
        irqchip/clps711x: Remove unused clps711x_intc_init() function
        irqchip/mmp: Remove non-DT codepath
        irqchip/ftintc010: Mark all function static
        irqdomain: Include internals.h for function prototypes
        irqchip/loongson-eiointc: Add DT init support
        dt-bindings: interrupt-controller: Add Loongson EIOINTC
        irqchip/loongson-eiointc: Fix irq affinity setting during resume
        irqchip/loongson-liointc: Add IRQCHIP_SKIP_SET_WAKE flag
        irqchip/loongson-liointc: Fix IRQ trigger polarity
        irqchip/loongson-pch-pic: Fix potential incorrect hwirq assignment
        irqchip/loongson-pch-pic: Fix initialization of HT vector register
        irqchip/gic-v3-its: Enable RESEND_WHEN_IN_PROGRESS for LPIs
        genirq: Allow fasteoi handler to resend interrupts on concurrent handling
        genirq: Expand doc for PENDING and REPLAY flags
        ...
      00173879
    • Linus Torvalds's avatar
      Merge tag 'core-debugobjects-2023-06-26' of... · cef2dd76
      Linus Torvalds authored
      Merge tag 'core-debugobjects-2023-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip
      
      Pull debugobjects update from Thomas Gleixner:
       "A single update for debug objects:
      
         - Recheck whether debug objects is enabled before reporting a problem
           to avoid spamming the logs with messages which are caused by a
           concurrent OOM"
      
      * tag 'core-debugobjects-2023-06-26' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        debugobjects: Recheck debug_objects_enabled before reporting
      cef2dd76
    • Linus Torvalds's avatar
      Merge tag 'for-6.5/block-2023-06-23' of git://git.kernel.dk/linux · a0433f8c
      Linus Torvalds authored
      Pull block updates from Jens Axboe:
      
       - NVMe pull request via Keith:
            - Various cleanups all around (Irvin, Chaitanya, Christophe)
            - Better struct packing (Christophe JAILLET)
            - Reduce controller error logs for optional commands (Keith)
            - Support for >=64KiB block sizes (Daniel Gomez)
            - Fabrics fixes and code organization (Max, Chaitanya, Daniel
              Wagner)
      
       - bcache updates via Coly:
            - Fix a race at init time (Mingzhe Zou)
            - Misc fixes and cleanups (Andrea, Thomas, Zheng, Ye)
      
       - use page pinning in the block layer for dio (David)
      
       - convert old block dio code to page pinning (David, Christoph)
      
       - cleanups for pktcdvd (Andy)
      
       - cleanups for rnbd (Guoqing)
      
       - use the unchecked __bio_add_page() for the initial single page
         additions (Johannes)
      
       - fix overflows in the Amiga partition handling code (Michael)
      
       - improve mq-deadline zoned device support (Bart)
      
       - keep passthrough requests out of the IO schedulers (Christoph, Ming)
      
       - improve support for flush requests, making them less special to deal
         with (Christoph)
      
       - add bdev holder ops and shutdown methods (Christoph)
      
       - fix the name_to_dev_t() situation and use cases (Christoph)
      
       - decouple the block open flags from fmode_t (Christoph)
      
       - ublk updates and cleanups, including adding user copy support (Ming)
      
       - BFQ sanity checking (Bart)
      
       - convert brd from radix to xarray (Pankaj)
      
       - constify various structures (Thomas, Ivan)
      
       - more fine grained persistent reservation ioctl capability checks
         (Jingbo)
      
       - misc fixes and cleanups (Arnd, Azeem, Demi, Ed, Hengqi, Hou, Jan,
         Jordy, Li, Min, Yu, Zhong, Waiman)
      
      * tag 'for-6.5/block-2023-06-23' of git://git.kernel.dk/linux: (266 commits)
        scsi/sg: don't grab scsi host module reference
        ext4: Fix warning in blkdev_put()
        block: don't return -EINVAL for not found names in devt_from_devname
        cdrom: Fix spectre-v1 gadget
        block: Improve kernel-doc headers
        blk-mq: don't insert passthrough request into sw queue
        bsg: make bsg_class a static const structure
        ublk: make ublk_chr_class a static const structure
        aoe: make aoe_class a static const structure
        block/rnbd: make all 'class' structures const
        block: fix the exclusive open mask in disk_scan_partitions
        block: add overflow checks for Amiga partition support
        block: change all __u32 annotations to __be32 in affs_hardblocks.h
        block: fix signed int overflow in Amiga partition support
        block: add capacity validation in bdev_add_partition()
        block: fine-granular CAP_SYS_ADMIN for Persistent Reservation
        block: disallow Persistent Reservation on partitions
        reiserfs: fix blkdev_put() warning from release_journal_dev()
        block: fix wrong mode for blkdev_get_by_dev() from disk_scan_partitions()
        block: document the holder argument to blkdev_get_by_path
        ...
      a0433f8c
    • Linus Torvalds's avatar
      Merge tag 'for-6.5/io_uring-2023-06-23' of git://git.kernel.dk/linux · 0aa69d53
      Linus Torvalds authored
      Pull io_uring updates from Jens Axboe:
       "Nothing major in this release, just a bunch of cleanups and some
        optimizations around networking mostly.
      
         - clean up file request flags handling (Christoph)
      
         - clean up request freeing and CQ locking (Pavel)
      
         - support for using pre-registering the io_uring fd at setup time
           (Josh)
      
         - Add support for user allocated ring memory, rather than having the
           kernel allocate it. Mostly for packing rings into a huge page (me)
      
         - avoid an unnecessary double retry on receive (me)
      
         - maintain ordering for task_work, which also improves performance
           (me)
      
         - misc cleanups/fixes (Pavel, me)"
      
      * tag 'for-6.5/io_uring-2023-06-23' of git://git.kernel.dk/linux: (39 commits)
        io_uring: merge conditional unlock flush helpers
        io_uring: make io_cq_unlock_post static
        io_uring: inline __io_cq_unlock
        io_uring: fix acquire/release annotations
        io_uring: kill io_cq_unlock()
        io_uring: remove IOU_F_TWQ_FORCE_NORMAL
        io_uring: don't batch task put on reqs free
        io_uring: move io_clean_op()
        io_uring: inline io_dismantle_req()
        io_uring: remove io_free_req_tw
        io_uring: open code io_put_req_find_next
        io_uring: add helpers to decode the fixed file file_ptr
        io_uring: use io_file_from_index in io_msg_grab_file
        io_uring: use io_file_from_index in __io_sync_cancel
        io_uring: return REQ_F_ flags from io_file_get_flags
        io_uring: remove io_req_ffs_set
        io_uring: remove a confusing comment above io_file_get_flags
        io_uring: remove the mode variable in io_file_get_flags
        io_uring: remove __io_file_supports_nowait
        io_uring: wait interruptibly for request completions on exit
        ...
      0aa69d53
    • Linus Torvalds's avatar
      Merge tag 'for-6.5/splice-2023-06-23' of git://git.kernel.dk/linux · 3eccc0c8
      Linus Torvalds authored
      Pull splice updates from Jens Axboe:
       "This kills off ITER_PIPE to avoid a race between truncate,
        iov_iter_revert() on the pipe and an as-yet incomplete DMA to a bio
        with unpinned/unref'ed pages from an O_DIRECT splice read. This causes
        memory corruption.
      
        Instead, we either use (a) filemap_splice_read(), which invokes the
        buffered file reading code and splices from the pagecache into the
        pipe; (b) copy_splice_read(), which bulk-allocates a buffer, reads
        into it and then pushes the filled pages into the pipe; or (c) handle
        it in filesystem-specific code.
      
        Summary:
      
         - Rename direct_splice_read() to copy_splice_read()
      
         - Simplify the calculations for the number of pages to be reclaimed
           in copy_splice_read()
      
         - Turn do_splice_to() into a helper, vfs_splice_read(), so that it
           can be used by overlayfs and coda to perform the checks on the
           lower fs
      
         - Make vfs_splice_read() jump to copy_splice_read() to handle
           direct-I/O and DAX
      
         - Provide shmem with its own splice_read to handle non-existent pages
           in the pagecache. We don't want a ->read_folio() as we don't want
           to populate holes, but filemap_get_pages() requires it
      
         - Provide overlayfs with its own splice_read to call down to a lower
           layer as overlayfs doesn't provide ->read_folio()
      
         - Provide coda with its own splice_read to call down to a lower layer
           as coda doesn't provide ->read_folio()
      
         - Direct ->splice_read to copy_splice_read() in tty, procfs, kernfs
           and random files as they just copy to the output buffer and don't
           splice pages
      
         - Provide wrappers for afs, ceph, ecryptfs, ext4, f2fs, nfs, ntfs3,
           ocfs2, orangefs, xfs and zonefs to do locking and/or revalidation
      
         - Make cifs use filemap_splice_read()
      
         - Replace pointers to generic_file_splice_read() with pointers to
           filemap_splice_read() as DIO and DAX are handled in the caller;
           filesystems can still provide their own alternate ->splice_read()
           op
      
         - Remove generic_file_splice_read()
      
         - Remove ITER_PIPE and its paraphernalia as generic_file_splice_read
           was the only user"
      
      * tag 'for-6.5/splice-2023-06-23' of git://git.kernel.dk/linux: (31 commits)
        splice: kdoc for filemap_splice_read() and copy_splice_read()
        iov_iter: Kill ITER_PIPE
        splice: Remove generic_file_splice_read()
        splice: Use filemap_splice_read() instead of generic_file_splice_read()
        cifs: Use filemap_splice_read()
        trace: Convert trace/seq to use copy_splice_read()
        zonefs: Provide a splice-read wrapper
        xfs: Provide a splice-read wrapper
        orangefs: Provide a splice-read wrapper
        ocfs2: Provide a splice-read wrapper
        ntfs3: Provide a splice-read wrapper
        nfs: Provide a splice-read wrapper
        f2fs: Provide a splice-read wrapper
        ext4: Provide a splice-read wrapper
        ecryptfs: Provide a splice-read wrapper
        ceph: Provide a splice-read wrapper
        afs: Provide a splice-read wrapper
        9p: Add splice_read wrapper
        net: Make sock_splice_read() use copy_splice_read() by default
        tty, proc, kernfs, random: Use copy_splice_read()
        ...
      3eccc0c8
    • Linus Torvalds's avatar
      Merge tag 'for-6.5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux · cc423f63
      Linus Torvalds authored
      Pull btrfs updates from David Sterba:
       "Mainly core changes, refactoring and optimizations.
      
        Performance is improved in some areas, overall there may be a
        cumulative improvement due to refactoring that removed lookups in the
        IO path or simplified IO submission tracking.
      
        Core:
      
         - submit IO synchronously for fast checksums (crc32c and xxhash),
           remove high priority worker kthread
      
         - read extent buffer in one go, simplify IO tracking, bio submission
           and locking
      
         - remove additional tracking of redirtied extent buffers, originally
           added for zoned mode but actually not needed
      
         - track ordered extent pointer in bio to avoid rbtree lookups during
           IO
      
         - scrub, use recovered data stripes as cache to avoid unnecessary
           read
      
         - in zoned mode, optimize logical to physical mappings of extents
      
         - remove PageError handling, not set by VFS nor writeback
      
         - cleanups, refactoring, better structure packing
      
         - lots of error handling improvements
      
         - more assertions, lockdep annotations
      
         - print assertion failure with the exact line where it happens
      
         - tracepoint updates
      
         - more debugging prints
      
        Performance:
      
         - speedup in fsync(), better tracking of inode logged status can
           avoid transaction commit
      
         - IO path structures track logical offsets in data structures and
           does not need to look it up
      
        User visible changes:
      
         - don't commit transaction for every created subvolume, this can
           reduce time when many subvolumes are created in a batch
      
         - print affected files when relocation fails
      
         - trigger orphan file cleanup during START_SYNC ioctl
      
        Notable fixes:
      
         - fix crash when disabling quota and relocation
      
         - fix crashes when removing roots from drity list
      
         - fix transacion abort during relocation when converting from newer
           profiles not covered by fallback
      
         - in zoned mode, stop reclaiming block groups if filesystem becomes
           read-only
      
         - fix rare race condition in tree mod log rewind that can miss some
           btree node slots
      
         - with enabled fsverity, drop up-to-date page bit in case the
           verification fails"
      
      * tag 'for-6.5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (194 commits)
        btrfs: fix race between quota disable and relocation
        btrfs: add comment to struct btrfs_fs_info::dirty_cowonly_roots
        btrfs: fix race when deleting free space root from the dirty cow roots list
        btrfs: fix race when deleting quota root from the dirty cow roots list
        btrfs: tracepoints: also show actual number of the outstanding extents
        btrfs: update i_version in update_dev_time
        btrfs: make btrfs_compressed_bioset static
        btrfs: add handling for RAID1C23/DUP to btrfs_reduce_alloc_profile
        btrfs: scrub: remove btrfs_fs_info::scrub_wr_completion_workers
        btrfs: scrub: remove scrub_ctx::csum_list member
        btrfs: do not BUG_ON after failure to migrate space during truncation
        btrfs: do not BUG_ON on failure to get dir index for new snapshot
        btrfs: send: do not BUG_ON() on unexpected symlink data extent
        btrfs: do not BUG_ON() when dropping inode items from log root
        btrfs: replace BUG_ON() at split_item() with proper error handling
        btrfs: do not BUG_ON() on tree mod log failures at btrfs_del_ptr()
        btrfs: do not BUG_ON() on tree mod log failures at insert_ptr()
        btrfs: do not BUG_ON() on tree mod log failure at insert_new_root()
        btrfs: do not BUG_ON() on tree mod log failures at push_nodes_for_insert()
        btrfs: abort transaction at update_ref_for_cow() when ref count is zero
        ...
      cc423f63
    • Linus Torvalds's avatar
      Merge tag 'zonefs-6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs · e940efa9
      Linus Torvalds authored
      Pull zonefs updates from Damien Le Moal:
      
       - Modify the synchronous direct write path to use iomap instead of
         manually coding issuing zone append write BIOs (me)
      
       - Use the FMODE_CAN_ODIRECT file flag to indicate support from direct
         IO instead of using the old way with noop direct_io methods
         (Christoph)
      
      * tag 'zonefs-6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs:
        zonefs: set FMODE_CAN_ODIRECT instead of a dummy direct_IO method
        zonefs: use iomap for synchronous direct writes
      e940efa9
    • Linus Torvalds's avatar
      Merge tag 'erofs-for-6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs · 098c5dd9
      Linus Torvalds authored
      Pull erofs updates from Gao Xiang:
       "No outstanding new feature for this cycle.
      
        Most of these commits are decompression cleanups which are part of the
        ongoing development for subpage/folio compression support as well as
        xattr cleanups for the upcoming xattr bloom filter optimization [1].
      
        In addition, there are bugfixes to address some corner cases of
        compressed images due to global data de-duplication and arm64 16k
        pages.
      
        Summary:
      
         - Fix rare I/O hang on deduplicated compressed images due to loop
           hooked chains
      
         - Fix compact compression layout of 16k blocks on arm64 devices
      
         - Fix atomic context detection of async decompression
      
         - Decompression/Xattr code cleanups"
      
      Link: https://lore.kernel.org/r/20230621083209.116024-1-jefflexu@linux.alibaba.com [1]
      
      * tag 'erofs-for-6.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
        erofs: clean up zmap.c
        erofs: remove unnecessary goto
        erofs: Fix detection of atomic context
        erofs: use separate xattr parsers for listxattr/getxattr
        erofs: unify inline/shared xattr iterators for listxattr/getxattr
        erofs: make the size of read data stored in buffer_ofs
        erofs: unify xattr_iter structures
        erofs: use absolute position in xattr iterator
        erofs: fix compact 4B support for 16k block size
        erofs: convert erofs_read_metabuf() to erofs_bread() for xattr
        erofs: use poison pointer to replace the hard-coded address
        erofs: use struct lockref to replace handcrafted approach
        erofs: adapt managed inode operations into folios
        erofs: kill hooked chains to avoid loops on deduplicated compressed images
        erofs: avoid on-stack pagepool directly passed by arguments
        erofs: allocate extra bvec pages directly instead of retrying
        erofs: clean up z_erofs_pcluster_readmore()
        erofs: remove the member readahead from struct z_erofs_decompress_frontend
        erofs: fold in z_erofs_decompress()
      098c5dd9
    • Linus Torvalds's avatar
      Merge tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linux · 74774e24
      Linus Torvalds authored
      Pull fsverity updates from Eric Biggers:
       "Several updates for fs/verity/:
      
         - Do all hashing with the shash API instead of with the ahash API.
      
           This simplifies the code and reduces API overhead. It should also
           make things slightly easier for XFS's upcoming support for
           fsverity. It does drop fsverity's support for off-CPU hash
           accelerators, but that support was incomplete and not known to be
           used
      
         - Update and export fsverity_get_digest() so that it's ready for
           overlayfs's upcoming support for fsverity checking of lowerdata
      
         - Improve the documentation for builtin signature support
      
         - Fix a bug in the large folio support"
      
      * tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fsverity/linux:
        fsverity: improve documentation for builtin signature support
        fsverity: rework fsverity_get_digest() again
        fsverity: simplify error handling in verify_data_block()
        fsverity: don't use bio_first_page_all() in fsverity_verify_bio()
        fsverity: constify fsverity_hash_alg
        fsverity: use shash API instead of ahash API
      74774e24
    • Linus Torvalds's avatar
      Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/linux · 4d483ab7
      Linus Torvalds authored
      Pull fscrypt update from Eric Biggers:
       "Just one flex array conversion patch"
      
      * tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/linux:
        fscrypt: Replace 1-element array with flexible array
      4d483ab7
    • Linus Torvalds's avatar
      Merge tag 'nfsd-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux · f7976a64
      Linus Torvalds authored
      Pull nfsd updates from Chuck Lever:
      
       - Clean-ups in the READ path in anticipation of MSG_SPLICE_PAGES
      
       - Better NUMA awareness when allocating pages and other objects
      
       - A number of minor clean-ups to XDR encoding
      
       - Elimination of a race when accepting a TCP socket
      
       - Numerous observability enhancements
      
      * tag 'nfsd-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (46 commits)
        nfsd: remove redundant assignments to variable len
        svcrdma: Fix stale comment
        NFSD: Distinguish per-net namespace initialization
        nfsd: move init of percpu reply_cache_stats counters back to nfsd_init_net
        SUNRPC: Address RCU warning in net/sunrpc/svc.c
        SUNRPC: Use sysfs_emit in place of strlcpy/sprintf
        SUNRPC: Remove transport class dprintk call sites
        SUNRPC: Fix comments for transport class registration
        svcrdma: Remove an unused argument from __svc_rdma_put_rw_ctxt()
        svcrdma: trace cc_release calls
        svcrdma: Convert "might sleep" comment into a code annotation
        NFSD: Add an nfsd4_encode_nfstime4() helper
        SUNRPC: Move initialization of rq_stime
        SUNRPC: Optimize page release in svc_rdma_sendto()
        svcrdma: Prevent page release when nothing was received
        svcrdma: Revert 2a1e4f21 ("svcrdma: Normalize Send page handling")
        SUNRPC: Revert 57990067 ("svcrdma: Remove unused sc_pages field")
        SUNRPC: Revert cc93ce95 ("svcrdma: Retain the page backing rq_res.head[0].iov_base")
        NFSD: add encoding of op_recall flag for write delegation
        NFSD: Add "official" reviewers for this subsystem
        ...
      f7976a64
    • Linus Torvalds's avatar
      Merge tag 'v6.5/vfs.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs · c0a572d9
      Linus Torvalds authored
      Pull vfs mount updates from Christian Brauner:
       "This contains the work to extend move_mount() to allow adding a mount
        beneath the topmost mount of a mount stack.
      
        There are two LWN articles about this. One covers the original patch
        series in [1]. The other in [2] summarizes the session and roughly the
        discussion between Al and me at LSFMM. The second article also goes
        into some good questions from attendees.
      
        Since all details are found in the relevant commit with a technical
        dive into semantics and locking at the end I'm only adding the
        motivation and core functionality for this from commit message and
        leave out the invasive details. The code is also heavily commented and
        annotated as well which was explicitly requested.
      
        TL;DR:
      
          > mount -t ext4 /dev/sda /mnt
            |
            └─/mnt    /dev/sda    ext4
      
          > mount --beneath -t xfs /dev/sdb /mnt
            |
            └─/mnt    /dev/sdb    xfs
              └─/mnt  /dev/sda    ext4
      
          > umount /mnt
            |
            └─/mnt    /dev/sdb    xfs
      
        The longer motivation is that various distributions are adding or are
        in the process of adding support for system extensions and in the
        future configuration extensions through various tools. A more detailed
        explanation on system and configuration extensions can be found on the
        manpage which is listed below at [3].
      
        System extension images may – dynamically at runtime — extend the
        /usr/ and /opt/ directory hierarchies with additional files. This is
        particularly useful on immutable system images where a /usr/ and/or
        /opt/ hierarchy residing on a read-only file system shall be extended
        temporarily at runtime without making any persistent modifications.
      
        When one or more system extension images are activated, their /usr/
        and /opt/ hierarchies are combined via overlayfs with the same
        hierarchies of the host OS, and the host /usr/ and /opt/ overmounted
        with it ("merging"). When they are deactivated, the mount point is
        disassembled — again revealing the unmodified original host version of
        the hierarchy ("unmerging"). Merging thus makes the extension's
        resources suddenly appear below the /usr/ and /opt/ hierarchies as if
        they were included in the base OS image itself. Unmerging makes them
        disappear again, leaving in place only the files that were shipped
        with the base OS image itself.
      
        System configuration images are similar but operate on directories
        containing system or service configuration.
      
        On nearly all modern distributions mount propagation plays a crucial
        role and the rootfs of the OS is a shared mount in a peer group
        (usually with peer group id 1):
      
           TARGET  SOURCE  FSTYPE  PROPAGATION  MNT_ID  PARENT_ID
           /       /       ext4    shared:1     29      1
      
        On such systems all services and containers run in a separate mount
        namespace and are pivot_root()ed into their rootfs. A separate mount
        namespace is almost always used as it is the minimal isolation
        mechanism services have. But usually they are even much more isolated
        up to the point where they almost become indistinguishable from
        containers.
      
        Mount propagation again plays a crucial role here. The rootfs of all
        these services is a slave mount to the peer group of the host rootfs.
        This is done so the service will receive mount propagation events from
        the host when certain files or directories are updated.
      
        In addition, the rootfs of each service, container, and sandbox is
        also a shared mount in its separate peer group:
      
           TARGET  SOURCE  FSTYPE  PROPAGATION         MNT_ID  PARENT_ID
           /       /       ext4    shared:24 master:1  71      47
      
        For people not too familiar with mount propagation, the master:1 means
        that this is a slave mount to peer group 1. Which as one can see is
        the host rootfs as indicated by shared:1 above. The shared:24
        indicates that the service rootfs is a shared mount in a separate peer
        group with peer group id 24.
      
        A service may run other services. Such nested services will also have
        a rootfs mount that is a slave to the peer group of the outer service
        rootfs mount.
      
        For containers things are just slighly different. A container's rootfs
        isn't a slave to the service's or host rootfs' peer group. The rootfs
        mount of a container is simply a shared mount in its own peer group:
      
           TARGET                    SOURCE  FSTYPE  PROPAGATION  MNT_ID  PARENT_ID
           /home/ubuntu/debian-tree  /       ext4    shared:99    61      60
      
        So whereas services are isolated OS components a container is treated
        like a separate world and mount propagation into it is restricted to a
        single well known mount that is a slave to the peer group of the
        shared mount /run on the host:
      
           TARGET                  SOURCE              FSTYPE  PROPAGATION  MNT_ID  PARENT_ID
           /propagate/debian-tree  /run/host/incoming  tmpfs   master:5     71      68
      
        Here, the master:5 indicates that this mount is a slave to the peer
        group with peer group id 5. This allows to propagate mounts into the
        container and served as a workaround for not being able to insert
        mounts into mount namespaces directly. But the new mount api does
        support inserting mounts directly. For the interested reader the
        blogpost in [4] might be worth reading where I explain the old and the
        new approach to inserting mounts into mount namespaces.
      
        Containers of course, can themselves be run as services. They often
        run full systems themselves which means they again run services and
        containers with the exact same propagation settings explained above.
      
        The whole system is designed so that it can be easily updated,
        including all services in various fine-grained ways without having to
        enter every single service's mount namespace which would be
        prohibitively expensive. The mount propagation layout has been
        carefully chosen so it is possible to propagate updates for system
        extensions and configurations from the host into all services.
      
        The simplest model to update the whole system is to mount on top of
        /usr, /opt, or /etc on the host. The new mount on /usr, /opt, or /etc
        will then propagate into every service. This works cleanly the first
        time. However, when the system is updated multiple times it becomes
        necessary to unmount the first update on /opt, /usr, /etc and then
        propagate the new update. But this means, there's an interval where
        the old base system is accessible. This has to be avoided to protect
        against downgrade attacks.
      
        The vfs already exposes a mechanism to userspace whereby mounts can be
        mounted beneath an existing mount. Such mounts are internally referred
        to as "tucked". The patch series exposes the ability to mount beneath
        a top mount through the new MOVE_MOUNT_BENEATH flag for the
        move_mount() system call. This allows userspace to seamlessly upgrade
        mounts. After this series the only thing that will have changed is
        that mounting beneath an existing mount can be done explicitly instead
        of just implicitly.
      
        The crux is that the proposed mechanism already exists and that it is
        so powerful as to cover cases where mounts are supposed to be updated
        with new versions. Crucially, it offers an important flexibility.
        Namely that updates to a system may either be forced or can be delayed
        and the umount of the top mount be left to a service if it is a
        cooperative one"
      
      Link: https://lwn.net/Articles/927491 [1]
      Link: https://lwn.net/Articles/934094 [2]
      Link: https://man7.org/linux/man-pages/man8/systemd-sysext.8.html [3]
      Link: https://brauner.io/2023/02/28/mounting-into-mount-namespaces.html [4]
      Link: https://github.com/flatcar/sysext-bakery
      Link: https://fedoraproject.org/wiki/Changes/Unified_Kernel_Support_Phase_1
      Link: https://fedoraproject.org/wiki/Changes/Unified_Kernel_Support_Phase_2
      Link: https://github.com/systemd/systemd/pull/26013
      
      * tag 'v6.5/vfs.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
        fs: allow to mount beneath top mount
        fs: use a for loop when locking a mount
        fs: properly document __lookup_mnt()
        fs: add path_mounted()
      c0a572d9
    • Linus Torvalds's avatar
      Merge tag 'v6.5/vfs.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs · 1f2300a7
      Linus Torvalds authored
      Pull vfs file handling updates from Christian Brauner:
       "This contains Amir's work to fix a long-standing problem where an
        unprivileged overlayfs mount can be used to avoid fanotify permission
        events that were requested for an inode or superblock on the
        underlying filesystem.
      
        Some background about files opened in overlayfs. If a file is opened
        in overlayfs @file->f_path will refer to a "fake" path. What this
        means is that while @file->f_inode will refer to inode of the
        underlying layer, @file->f_path refers to an overlayfs
        {dentry,vfsmount} pair. The reasons for doing this are out of scope
        here but it is the reason why the vfs has been providing the
        open_with_fake_path() helper for overlayfs for very long time now. So
        nothing new here.
      
        This is for sure not very elegant and everyone including the overlayfs
        maintainers agree. Improving this significantly would involve more
        fragile and potentially rather invasive changes.
      
        In various codepaths access to the path of the underlying filesystem
        is needed for such hybrid file. The best example is fsnotify where
        this becomes security relevant. Passing the overlayfs
        @file->f_path->dentry will cause fsnotify to skip generating fsnotify
        events registered on the underlying inode or superblock.
      
        To fix this we extend the vfs provided open_with_fake_path() concept
        for overlayfs to create a backing file container that holds the real
        path and to expose a helper that can be used by relevant callers to
        get access to the path of the underlying filesystem through the new
        file_real_path() helper. This pattern is similar to what we do in
        d_real() and d_real_inode().
      
        The first beneficiary is fsnotify and fixes the security sensitive
        problem mentioned above.
      
        There's a couple of nice cleanups included as well.
      
        Over time, the old open_with_fake_path() helper added specifically for
        overlayfs a long time ago started to get used in other places such as
        cachefiles. Even though cachefiles have nothing to do with hybrid
        files.
      
        The only reason cachefiles used that concept was that files opened
        with open_with_fake_path() aren't charged against the caller's open
        file limit by raising FMODE_NOACCOUNT. It's just mere coincidence that
        both overlayfs and cachefiles need to ensure to not overcharge the
        caller for their internal open calls.
      
        So this work disentangles FMODE_NOACCOUNT use cases and backing file
        use-cases by adding the FMODE_BACKING flag which indicates that the
        file can be used to retrieve the backing file of another filesystem.
        (Fyi, Jens will be sending you a really nice cleanup from Christoph
        that gets rid of 3 FMODE_* flags otherwise this would be the last
        fmode_t bit we'd be using.)
      
        So now overlayfs becomes the sole user of the renamed
        open_with_fake_path() helper which is now named backing_file_open().
        For internal kernel users such as cachefiles that are only interested
        in FMODE_NOACCOUNT but not in FMODE_BACKING we add a new
        kernel_file_open() helper which opens a file without being charged
        against the caller's open file limit. All new helpers are properly
        documented and clearly annotated to mention their special uses.
      
        We also rename vfs_tmpfile_open() to kernel_tmpfile_open() to clearly
        distinguish it from vfs_tmpfile() and align it the other kernel_*()
        internal helpers"
      
      * tag 'v6.5/vfs.file' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
        ovl: enable fsnotify events on underlying real files
        fs: use backing_file container for internal files with "fake" f_path
        fs: move kmem_cache_zalloc() into alloc_empty_file*() helpers
        fs: use a helper for opening kernel internal files
        fs: rename {vfs,kernel}_tmpfile_open()
      1f2300a7
    • Linus Torvalds's avatar
      Merge tag 'v6.5/vfs.rename.locking' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs · 2eedfa9e
      Linus Torvalds authored
      Pull vfs rename locking updates from Christian Brauner:
       "This contains the work from Jan to fix problems with cross-directory
        renames originally reported in [1].
      
        To quickly sum it up some filesystems (so far we know at least about
        ext4, udf, f2fs, ocfs2, likely also reiserfs, gfs2 and others) need to
        lock the directory when it is being renamed into another directory.
      
        This is because we need to update the parent pointer in the directory
        in that case and if that races with other operations on the directory,
        in particular a conversion from one directory format into another, bad
        things can happen.
      
        So far we've done the locking in the filesystem code but recently
        Darrick pointed out in [2] that the RENAME_EXCHANGE case was missing.
        That one is particularly nasty because RENAME_EXCHANGE can arbitrarily
        mix regular files and directories and proper lock ordering is not
        achievable in the filesystems alone.
      
        This patch set adds locking into vfs_rename() so that not only parent
        directories but also moved inodes, regardless of whether they are
        directories or not, are locked when calling into the filesystem.
      
        This means establishing a locking order for unrelated directories. New
        helpers are added for this purpose and our documentation is updated to
        cover this in detail.
      
        The locking is now actually easier to follow as we now always lock
        source and target. We've always locked the target independent of
        whether it was a directory or file and we've always locked source if
        it was a regular file. The exact details for why this came about can
        be found in [3] and [4]"
      
      Link: https://lore.kernel.org/all/20230117123735.un7wbamlbdihninm@quack3 [1]
      Link: https://lore.kernel.org/all/20230517045836.GA11594@frogsfrogsfrogs [2]
      Link: https://lore.kernel.org/all/20230526-schrebergarten-vortag-9cd89694517e@brauner [3]
      Link: https://lore.kernel.org/all/20230530-seenotrettung-allrad-44f4b00139d4@brauner [4]
      
      * tag 'v6.5/vfs.rename.locking' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
        fs: Restrict lock_two_nondirectories() to non-directory inodes
        fs: Lock moved directories
        fs: Establish locking order for unrelated directories
        Revert "f2fs: fix potential corruption when moving a directory"
        Revert "udf: Protect rename against modification of moved directory"
        ext4: Remove ext4 locking of moved directory
      2eedfa9e