Skip to content
  1. Jun 10, 2023
  2. May 28, 2023
  3. May 27, 2023
    • Linus Torvalds's avatar
      Merge tag 'cxl-fixes-6.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl · 49572d53
      Linus Torvalds authored
      Pull compute express link fixes from Dan Williams:
       "The 'media ready' series prevents the driver from acting on bad
        capacity information, and it moves some checks earlier in the init
        sequence which impacts topics in the queue for 6.5.
      
        Additional hotplug testing uncovered a missing enable for memory
        decode. A debug crash fix is also included.
      
        Summary:
      
         - Stop trusting capacity data before the "media ready" indication
      
         - Add missing HDM decoder capability enable for the cold-plug case
      
         - Fix a debug message induced crash"
      
      * tag 'cxl-fixes-6.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl:
        cxl: Explicitly initialize resources when media is not ready
        cxl/port: Fix NULL pointer access in devm_cxl_add_port()
        cxl: Move cxl_await_media_ready() to before capacity info retrieval
        cxl: Wait Memory_Info_Valid before access memory related info
        cxl/port: Enable the HDM decoder capability for switch ports
      49572d53
    • Linus Torvalds's avatar
      Merge tag 'arm-fixes-6.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc · 18713e8a
      Linus Torvalds authored
      Pull ARM SoC fixes from Arnd Bergmann:
       "There have not been a lot of fixes for for the soc tree in 6.4, but
        these have been sitting here for too long.
      
        For the devicetree side, there is one minor warning fix for vexpress,
        the rest all all for the the NXP i.MX platforms: SoC specific bugfixes
        for the iMX8 clocks and its USB-3.0 gadget device, as well as board
        specific fixes for regulators and the phy on some of the i.MX boards.
      
        The microchip risc-v and arm32 maintainers now also add a shared
        maintainer file entry for the arm64 parts.
      
        The remaining fixes are all for firmware drivers, addressing mistakes
        in the optee, scmi and ff-a firmware driver implementation, mostly in
        the error handling code, incorrect use of the alloc_workqueue()
        interface in SCMI, and compatibility with corner cases of the firmware
        implementation"
      
      * tag 'arm-fixes-6.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc:
        MAINTAINERS: update arm64 Microchip entries
        arm64: dts: imx8: fix USB 3.0 Gadget Failure in QM & QXPB0 at super speed
        dt-binding: cdns,usb3: Fix cdns,on-chip-buff-size type
        arm64: dts: colibri-imx8x: delete adc1 and dsp
        arm64: dts: colibri-imx8x: fix iris pinctrl configuration
        arm64: dts: colibri-imx8x: move pinctrl property from SoM to eval board
        arm64: dts: colibri-imx8x: fix eval board pin configuration
        arm64: dts: imx8mp: Fix video clock parents
        ARM: dts: imx6qdl-mba6: Add missing pvcie-supply regulator
        ARM: dts: imx6ull-dhcor: Set and limit the mode for PMIC buck 1, 2 and 3
        arm64: dts: imx8mn-var-som: fix PHY detection bug by adding deassert delay
        arm64: dts: imx8mn: Fix video clock parents
        firmware: arm_ffa: Set reserved/MBZ fields to zero in the memory descriptors
        firmware: arm_ffa: Fix FFA device names for logical partitions
        firmware: arm_ffa: Fix usage of partition info get count flag
        firmware: arm_ffa: Check if ffa_driver remove is present before executing
        arm64: dts: arm: add missing cache properties
        ARM: dts: vexpress: add missing cache properties
        firmware: arm_scmi: Fix incorrect alloc_workqueue() invocation
        optee: fix uninited async notif value
      18713e8a
    • Linus Torvalds's avatar
      Merge tag 'pci-v6.4-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci · 96f15fc6
      Linus Torvalds authored
      Pull PCI fix from Bjorn Helgaas:
      
       - Quirk Ice Lake Root Ports to work around DPC log size issue (Mika
         Westerberg)
      
      * tag 'pci-v6.4-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci:
        PCI/DPC: Quirk PIO log size for Intel Ice Lake Root Ports
      96f15fc6
    • Linus Torvalds's avatar
      Merge tag 'vfio-v6.4-rc4' of https://github.com/awilliam/linux-vfio · 8846af75
      Linus Torvalds authored
      Pull VFIO fix from Alex Williamson:
      
       - Test for and return error for invalid pfns through the pin pages
         interface (Yan Zhao)
      
      * tag 'vfio-v6.4-rc4' of https://github.com/awilliam/linux-vfio:
        vfio/type1: check pfn valid before converting to struct page
      8846af75
    • Linus Torvalds's avatar
      Merge tag 'block-6.4-2023-05-26' of git://git.kernel.dk/linux · a92c9ab6
      Linus Torvalds authored
      Pull block fixes from Jens Axboe:
       "A few fixes for the storage side of things:
      
         - Fix bio caching condition for passthrough IO (Anuj)
      
         - end-of-device check fix for zero sized devices (Christoph)
      
         - Update Paolo's email address
      
         - NVMe pull request via Keith with a single quirk addition
      
         - Fix regression in how wbt enablement is done (Yu)
      
         - Fix race in active queue accounting (Tian)"
      
      * tag 'block-6.4-2023-05-26' of git://git.kernel.dk/linux:
        NVMe: Add MAXIO 1602 to bogus nid list.
        block: make bio_check_eod work for zero sized devices
        block: fix bio-cache for passthru IO
        block, bfq: update Paolo's address in maintainer list
        blk-mq: fix race condition in active queue accounting
        blk-wbt: fix that wbt can't be disabled by default
      a92c9ab6
    • Linus Torvalds's avatar
      Merge tag 'io_uring-6.4-2023-05-26' of git://git.kernel.dk/linux · 6fae9129
      Linus Torvalds authored
      Pull io_uring fix from Jens Axboe:
       "Just a single fix for the conditional schedule with the SQPOLL thread,
        dropping the uring_lock if we do need to reschedule"
      
      * tag 'io_uring-6.4-2023-05-26' of git://git.kernel.dk/linux:
        io_uring: unlock sqd->lock before sq thread release CPU
      6fae9129
    • Linus Torvalds's avatar
      Merge tag 'thermal-6.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · 77af1f2b
      Linus Torvalds authored
      Pull thermal control fix from Rafael Wysocki:
       "Fix a regression introduced inadvertently during the 6.3 cycle by a
        commit making the Intel int340x thermal driver use sysfs_emit_at()
        instead of scnprintf() (Srinivas Pandruvada)"
      
      * tag 'thermal-6.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
        thermal: intel: int340x: Add new line for UUID display
      77af1f2b
    • Linus Torvalds's avatar
      Merge tag 'pm-6.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · c551afcd
      Linus Torvalds authored
      Pull power management fixes from Rafael Wysocki:
       "Fix three issues related to the ->fast_switch callback in the AMD
        P-state cpufreq driver (Gautham R. Shenoy and Wyes Karny)"
      
      * tag 'pm-6.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
        cpufreq: amd-pstate: Update policy->cur in amd_pstate_adjust_perf()
        cpufreq: amd-pstate: Remove fast_switch_possible flag from active driver
        cpufreq: amd-pstate: Add ->fast_switch() callback
      c551afcd
    • Dave Jiang's avatar
      cxl: Explicitly initialize resources when media is not ready · 793a539a
      Dave Jiang authored
      
      
      When media is not ready do not assume that the capacity information from
      the identify command is valid, i.e. ->total_bytes
      ->partition_align_bytes ->{volatile,persistent}_only_bytes. Explicitly
      zero out the capacity resources and exit early.
      
      Given zero-init of those fields this patch is functionally equivalent to
      the prior state, but it improves readability and robustness going
      forward.
      
      Signed-off-by: default avatarDave Jiang <dave.jiang@intel.com>
      Link: https://lore.kernel.org/r/168506118166.3004974.13523455340007852589.stgit@djiang5-mobl3
      
      
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      793a539a
    • Linus Torvalds's avatar
      Merge tag 'gpio-fixes-for-v6.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux · 91a30434
      Linus Torvalds authored
      Pull gpio fixes from Bartosz Golaszewski:
      
       - fix incorrect output in in-tree gpio tools
      
       - fix a shell coding issue in gpio-sim selftests
      
       - correctly set the permissions for debugfs attributes exposed by
         gpio-mockup
      
       - fix chip name and pin count in gpio-f7188x for one of the supported
         models
      
       - fix numberspace pollution when using dynamically and statically
         allocated GPIOs together
      
      * tag 'gpio-fixes-for-v6.4-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
        gpio-f7188x: fix chip name and pin count on Nuvoton chip
        gpiolib: fix allocation of mixed dynamic/static GPIOs
        gpio: mockup: Fix mode of debugfs files
        selftests: gpio: gpio-sim: Fix BUG: test FAILED due to recent change
        tools: gpio: fix debounce_period_us output of lsgpio
      91a30434
    • Linus Torvalds's avatar
      Merge tag 'for-6.4-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux · b158dd94
      Linus Torvalds authored
      Pull btrfs fixes from David Sterba:
      
       - handle memory allocation error in checksumming helper (reported by
         syzbot)
      
       - fix lockdep splat when aborting a transaction, add NOFS protection
         around invalidate_inode_pages2 that could allocate with GFP_KERNEL
      
       - reduce chances to hit an ENOSPC during scrub with RAID56 profiles
      
      * tag 'for-6.4-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
        btrfs: use nofs when cleaning up aborted transactions
        btrfs: handle memory allocation failure in btrfs_csum_one_bio
        btrfs: scrub: try harder to mark RAID56 block groups read-only
      b158dd94
    • Linus Torvalds's avatar
      Merge tag 'drm-fixes-2023-05-26' of git://anongit.freedesktop.org/drm/drm · b83ac44e
      Linus Torvalds authored
      Pull drm fixes from Dave Airlie:
       "This week's collection is pretty spread out, accel/qaic has a bunch of
        fixes, amdgpu, then lots of single fixes across a bunch of places.
      
        core:
         - fix drmm_mutex_init lock class
      
        mgag200:
         - fix gamma lut initialisation
      
        pl111:
         - fix FB depth on IMPD-1 framebuffer
      
        amdgpu:
         - Fix missing BO unlocking in KIQ error path
         - Avoid spurious secure display error messages
         - SMU13 fix
         - Fix an OD regression
         - GPU reset display IRQ warning fix
         - MST fix
      
        radeon:
         - Fix a DP regression
      
        i915:
         - PIPEDMC disabling fix for bigjoiner config
      
        panel:
         - fix aya neo air plus quirk
      
        sched:
         - remove redundant NULL check
      
        qaic:
         - fix NNC message corruption
         - Grab ch_lock during QAIC_ATTACH_SLICE_BO
         - Flush the transfer list again
         - Validate if BO is sliced before slicing
         - Validate user data before grabbing any lock
         - initialize ret variable to 0
         - silence some uninitialized variable warnings"
      
      * tag 'drm-fixes-2023-05-26' of git://anongit.freedesktop.org/drm/drm:
        drm/amd/display: Have Payload Properly Created After Resume
        drm/amd/display: Fix warning in disabling vblank irq
        drm/amd/pm: Fix output of pp_od_clk_voltage
        drm/amd/pm: add missing NotifyPowerSource message mapping for SMU13.0.7
        drm/radeon: reintroduce radeon_dp_work_func content
        drm/amdgpu: don't enable secure display on incompatible platforms
        drm:amd:amdgpu: Fix missing buffer object unlock in failure path
        accel/qaic: Fix NNC message corruption
        accel/qaic: Grab ch_lock during QAIC_ATTACH_SLICE_BO
        accel/qaic: Flush the transfer list again
        accel/qaic: Validate if BO is sliced before slicing
        accel/qaic: Validate user data before grabbing any lock
        accel/qaic: initialize ret variable to 0
        drm/i915: Fix PIPEDMC disabling for a bigjoiner configuration
        drm: fix drmm_mutex_init()
        drm/sched: Remove redundant check
        drm: panel-orientation-quirks: Change Air's quirk to support Air Plus
        accel/qaic: silence some uninitialized variable warnings
        drm/pl111: Fix FB depth on IMPD-1 framebuffer
        drm/mgag200: Fix gamma lut not initialized.
      b83ac44e
    • Linus Torvalds's avatar
      x86: re-introduce support for ERMS copies for user space accesses · 47ee3f1d
      Linus Torvalds authored
      I tried to streamline our user memory copy code fairly aggressively in
      commit adfcf423 ("x86: don't use REP_GOOD or ERMS for user memory
      copies"), in order to then be able to clean up the code and inline the
      modern FSRM case in commit 577e6a7f ("x86: inline the 'rep movs' in
      user copies for the FSRM case").
      
      We had reports [1] of that causing regressions earlier with blogbench,
      but that turned out to be a horrible benchmark for that case, and not a
      sufficient reason for re-instating "rep movsb" on older machines.
      
      However, now Eric Dumazet reported [2] a regression in performance that
      seems to be a rather more real benchmark, where due to the removal of
      "rep movs" a TCP stream over a 100Gbps network no longer reaches line
      speed.
      
      And it turns out that with the simplified the calling convention for the
      non-FSRM case in commit 427fda2c ("x86: improve on the non-rep
      'copy_user' function"), re-introducing the ERMS case is actually fairly
      simple.
      
      Of course, that "fairly simple" is glossing over several missteps due to
      having to fight our assembler alternative code.  This code really wanted
      to rewrite a conditional branch to have two different targets, but that
      made objtool sufficiently unhappy that this instead just ended up doing
      a choice between "jump to the unrolled loop, or use 'rep movsb'
      directly".
      
      Let's see if somebody finds a case where the kernel memory copies also
      care (see commit 68674f94: "x86: don't use REP_GOOD or ERMS for
      small memory copies").  But Eric does argue that the user copies are
      special because networking tries to copy up to 32KB at a time, if
      order-3 pages allocations are possible.
      
      In-kernel memory copies are typically small, unless they are the special
      "copy pages at a time" kind that still use "rep movs".
      
      Link: https://lore.kernel.org/lkml/202305041446.71d46724-yujie.liu@intel.com/ [1]
      Link: https://lore.kernel.org/lkml/CANn89iKUbyrJ=r2+_kK+sb2ZSSHifFZ7QkPLDpAtkJ8v4WUumA@mail.gmail.com/
      
       [2]
      Reported-and-tested-by: default avatarEric Dumazet <edumazet@google.com>
      Fixes: adfcf423
      
       ("x86: don't use REP_GOOD or ERMS for user memory copies")
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      47ee3f1d
  4. May 26, 2023
    • Jens Axboe's avatar
      Merge tag 'nvme-6.4-2023-05-26' of git://git.infradead.org/nvme into block-6.4 · 9491d01f
      Jens Axboe authored
      Pull NVMe fix from Keith:
      
      "nvme fixes for 6.4
      
       One nvme quirk (Tatsuki)"
      
      * tag 'nvme-6.4-2023-05-26' of git://git.infradead.org/nvme:
        NVMe: Add MAXIO 1602 to bogus nid list.
      9491d01f
    • Tatsuki Sugiura's avatar
      NVMe: Add MAXIO 1602 to bogus nid list. · a3a9d63d
      Tatsuki Sugiura authored
      
      
      HIKSEMI FUTURE M.2 SSD uses the same dummy nguid and eui64.
      I confirmed it with my two devices.
      
      This patch marks the controller as NVME_QUIRK_BOGUS_NID.
      
      ---------------------------------------------------------
      sugi@tempest:~% sudo nvme id-ctrl /dev/nvme0
      NVME Identify Controller:
      vid       : 0x1e4b
      ssvid     : 0x1e4b
      sn        : 30096022612
      mn        : HS-SSD-FUTURE 2048G
      fr        : SN10542
      rab       : 0
      ieee      : 000000
      cmic      : 0
      mdts      : 7
      cntlid    : 0
      ver       : 0x10400
      rtd3r     : 0x7a120
      rtd3e     : 0x1e8480
      oaes      : 0x200
      ctratt    : 0x2
      rrls      : 0
      cntrltype : 1
      fguid     : 00000000-0000-0000-0000-000000000000
      <snip...>
      ---------------------------------------------------------
      
      ---------------------------------------------------------
      sugi@tempest:~% sudo nvme id-ns /dev/nvme0n1
      NVME Identify Namespace 1:
      <snip...>
      nguid   : 00000000000000000000000000000000
      eui64   : 0000000000000002
      lbaf  0 : ms:0   lbads:9  rp:0 (in use)
      ---------------------------------------------------------
      
      Signed-off-by: default avatarTatsuki Sugiura <sugi@nemui.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarKeith Busch <kbusch@kernel.org>
      a3a9d63d
    • Arnd Bergmann's avatar
      Merge tag 'ffa-fixes-6.4' of... · abf5422e
      Arnd Bergmann authored
      Merge tag 'ffa-fixes-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/sudeep.holla/linux into arm/fixes
      
      Arm FF-A fixes for v6.4
      
      Quite a few fixes to address set of assorted issues:
      1. NULL pointer dereference if the ffa driver doesn't provide remove()
         callback as it is currently executed unconditionally
      2. FF-A core probe failure on systems with v1.0 firmware as the new
         partition info get count flag is used unconditionally
      3. Failure to register more than one logical partition or service within
         the same physical partition as the device name contains only VM ID
         which will be same for all but each will have unique UUID.
      4. Rejection of certain memory interface transmissions by the receivers
         (secure partitions) as few MBZ fields are non-zero due to lack of
         explicit re-initialization of those fields
      
      * tag 'ffa-fixes-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/sudeep.holla/linux:
        firmware: arm_ffa: Set reserved/MBZ fields to zero in the memory descriptors
        firmware: arm_ffa: Fix FFA device names for logical partitions
        firmware: arm_ffa: Fix usage of partition info get count flag
        firmware: arm_ffa: Check if ffa_driver remove is present before executing
      
      Link: https://lore.kernel.org/r/20230509143453.1188753-1-sudeep.holla@arm.com
      
      
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      abf5422e
    • Dave Airlie's avatar
      Merge tag 'drm-misc-fixes-2023-05-24' of git://anongit.freedesktop.org/drm/drm-misc into drm-fixes · 5502d1fa
      Dave Airlie authored
      
      
      drm-misc-fixes for v6.4-rc4:
      - A few non-trivial fixes to qaic.
      - Fix drmm_mutex_init always using same lock class.
      - Fix pl111 fb depth.
      - Fix uninitialised gamma lut in mgag200.
      - Add Aya Neo Air Plus quirk.
      - Trivial null check removal in scheduler.
      
      Signed-off-by: default avatarDave Airlie <airlied@redhat.com>
      From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/d19f748c-2c5b-8140-5b05-a8282dfef73e@linux.intel.com
      5502d1fa
    • Dave Airlie's avatar
      Merge tag 'amd-drm-fixes-6.4-2023-05-24' of... · 13aa38f8
      Dave Airlie authored
      Merge tag 'amd-drm-fixes-6.4-2023-05-24' of https://gitlab.freedesktop.org/agd5f/linux
      
       into drm-fixes
      
      amd-drm-fixes-6.4-2023-05-24:
      
      amdgpu:
      - Fix missing BO unlocking in KIQ error path
      - Avoid spurious secure display error messages
      - SMU13 fix
      - Fix an OD regression
      - GPU reset display IRQ warning fix
      - MST fix
      
      radeon:
      - Fix a DP regression
      
      Signed-off-by: default avatarDave Airlie <airlied@redhat.com>
      From: Alex Deucher <alexander.deucher@amd.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20230524211238.7749-1-alexander.deucher@amd.com
      13aa38f8
    • Dave Airlie's avatar
      Merge tag 'drm-intel-fixes-2023-05-25' of... · 94d39d01
      Dave Airlie authored
      Merge tag 'drm-intel-fixes-2023-05-25' of git://anongit.freedesktop.org/drm/drm-intel
      
       into drm-fixes
      
      PIPEDMC disabling fix for bigjoiner config
      
      Signed-off-by: default avatarDave Airlie <airlied@redhat.com>
      From: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/ZG9aROGyc947/J1l@jlahtine-mobl.ger.corp.intel.com
      94d39d01
    • Linus Torvalds's avatar
      Merge tag '6.4-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6 · 0d85b27b
      Linus Torvalds authored
      Pull smb directory moves and client fixes from Steve French:
       "Four smb3 client fixes (three of which marked for stable) and three
        patches to move of fs/cifs and fs/ksmbd to a new common "fs/smb"
        parent directory
      
         - Move the client and server source directories to a common parent
           directory:
      
             fs/cifs -> fs/smb/client
             fs/ksmbd -> fs/smb/server
             fs/smbfs_common -> fs/smb/common
      
         - important readahead fix
      
         - important fix for SMB1 regression
      
         - fix for missing mount option ("mapchars") in mount API conversion
      
         - minor debugging improvement"
      
      * tag '6.4-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
        smb3: move Documentation/filesystems/cifs to Documentation/filesystems/smb
        cifs: correct references in Documentation to old fs/cifs path
        smb: move client and server files to common directory fs/smb
        cifs: mapchars mount option ignored
        smb3: display debug information better for encryption
        cifs: fix smb1 mount regression
        cifs: Fix cifs_limit_bvec_subset() to correctly check the maxmimum size
      0d85b27b
    • Linus Torvalds's avatar
      Merge tag 'parisc-for-6.4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux · 192fe71c
      Linus Torvalds authored
      Pull parisc architecture fixes from Helge Deller:
       "Quite a bunch of real bugfixes in here and most of them are tagged for
        backporting: A fix for cache flushing from irq context, a kprobes &
        kgdb breakpoint handling fix, and a fix in the alternative code
        patching function to take care of CPU hotplugging.
      
        parisc now provides LOCKDEP support and comes with a lightweight
        spinlock check. Both features helped me to find the cache flush bug.
      
        Additionally writing the AGP gatt has been fixed, the machine allows
        the user to reboot after a system halt and arch_sync_dma_for_cpu() has
        been optimized for PCXL PCUs.
      
        Summary:
      
         - Fix flush_dcache_page() for usage from irq context
      
         - Handle kprobes breakpoints only in kernel context
      
         - Handle kgdb breakpoints only in kernel context
      
         - Use num_present_cpus() in alternative patching code
      
         - Enable LOCKDEP support
      
         - Add lightweight spinlock checks
      
         - Flush AGP gatt writes and adjust gatt mask in parisc_agp_mask_memory()
      
         - Allow to reboot machine after system halt
      
         - Improve cache flushing for PCXL in arch_sync_dma_for_cpu()"
      
      * tag 'parisc-for-6.4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
        parisc: Fix flush_dcache_page() for usage from irq context
        parisc: Handle kgdb breakpoints only in kernel context
        parisc: Handle kprobes breakpoints only in kernel context
        parisc: Allow to reboot machine after system halt
        parisc: Enable LOCKDEP support
        parisc: Add lightweight spinlock checks
        parisc: Use num_present_cpus() in alternative patching code
        parisc: Flush gatt writes and adjust gatt mask in parisc_agp_mask_memory()
        parisc: Improve cache flushing for PCXL in arch_sync_dma_for_cpu()
      192fe71c
    • Linus Torvalds's avatar
      module: error out early on concurrent load of the same module file · 9828ed3f
      Linus Torvalds authored
      It turns out that udev under certain circumstances will concurrently try
      to load the same modules over-and-over excessively.  This isn't a kernel
      bug, but it ends up affecting the kernel, to the point that under
      certain circumstances we can fail to boot, because the kernel uses a lot
      of memory to read all the module data all at once.
      
      Note that it isn't a memory leak, it's just basically a thundering herd
      problem happening at bootup with a lot of CPUs, with the worst cases
      then being pretty bad.
      
      Admittedly the worst situations are somewhat contrived: lots and lots of
      CPUs, not a lot of memory, and KASAN enabled to make it all slower and
      as such (unintentionally) exacerbate the problem.
      
      Luis explains: [1]
      
       "My best assessment of the situation is that each CPU in udev ends up
        triggering a load of duplicate set of modules, not just one, but *a
        lot*. Not sure what heuristics udev uses to load a set of modules per
        CPU."
      
      Petr Pavlu chimes in: [2]
      
       "My understanding is that udev workers are forked. An initial kmod
        context is created by the main udevd process but no sharing happens
        after the fork. It means that the mentioned memory pool logic doesn't
        really kick in.
      
        Multiple parallel load requests come from multiple udev workers, for
        instance, each handling an udev event for one CPU device and making
        the exactly same requests as all others are doing at the same time.
      
        The optimization idea would be to recognize these duplicate requests
        at the udevd/kmod level and converge them"
      
      Note that module loading has tried to mitigate this issue before, see
      for example commit 064f4536 ("module: avoid allocation if module is
      already present and ready"), which has a few ASCII graphs on memory use
      due to this same issue.
      
      However, while that noticed that the module was already loaded, and
      exited with an error early before spending any more time on setting up
      the module, it didn't handle the case of multiple concurrent module
      loads all being active - but not complete - at the same time.
      
      Yes, one of them will eventually win the race and finalize its copy, and
      the others will then notice that the module already exists and error
      out, but while this all happens, we have tons of unnecessary concurrent
      work being done.
      
      Again, the real fix is for udev to not do that (maybe it should use
      threads instead of fork, and have actual shared data structures and not
      cause duplicate work). That real fix is apparently not trivial.
      
      But it turns out that the kernel already has a pretty good model for
      dealing with concurrent access to the same file: the i_writecount of the
      inode.
      
      In fact, the module loading already indirectly uses 'i_writecount' ,
      because 'kernel_file_read()' will in fact do
      
      	ret = deny_write_access(file);
      	if (ret)
      		return ret;
      	...
      	allow_write_access(file);
      
      around the read of the file data.  We do not allow concurrent writes to
      the file, and return -ETXTBUSY if the file was open for writing at the
      same time as the module data is loaded from it.
      
      And the solution to the reader concurrency problem is to simply extend
      this "no concurrent writers" logic to simply be "exclusive access".
      
      Note that "exclusive" in this context isn't really some absolute thing:
      it's only exclusion from writers and from other "special readers" that
      do this writer denial.  So we simply introduce a variation of that
      "deny_write_access()" logic that not only denies write access, but also
      requires that this is the _only_ such access that denies write access.
      
      Which means that you can't start loading a module that is already being
      loaded as a module by somebody else, or you will get the same -ETXTBSY
      error that you would get if there were writers around.
      
      [ It also means that you can't try to load a currently executing
        executable as a module, for the same reason: executables do that same
        "deny_write_access()" thing, and that's obviously where the whole
        ETXTBSY logic traditionally came from.
      
        This is not a problem for kernel modules, since the set of normal
        executable files and kernel module files is entirely disjoint. ]
      
      This new function is called "exclusive_deny_write_access()", and the
      implementation is trivial, in that it's just an atomic decrement of
      i_writecount if it was 0 before.
      
      To use that new exclusivity check, all we then do is wrap the module
      loading with that exclusive_deny_write_access()() / allow_write_access()
      pair.  The actual patch is a bit bigger than that, because we want to
      surround not just the "load file data" part, but the whole module setup,
      to get maximum exclusion.
      
      So this ends up splitting up "finit_module()" into a few helper
      functions to make it all very clear and legible.
      
      In Luis' test-case (bringing up 255 vcpu's in a virtual machine [3]),
      the "wasted vmalloc" space (ie module data read into a vmalloc'ed area
      in order to be loaded as a module, but then discarded because somebody
      else loaded the same module instead) dropped from 1.8GiB to 474kB.  Yes,
      that's gigabytes to kilobytes.
      
      It doesn't drop completely to zero, because even with this change, you
      can still end up having completely serial pointless module loads, where
      one udev process has loaded a module fully (and thus the kernel has
      released that exclusive lock on the module file), and then another udev
      process tries to load the same module again.
      
      So while we cannot fully get rid of the fundamental bug in user space,
      we _can_ get rid of the excessive concurrent thundering herd effect.
      
      A couple of final side notes on this all:
      
       - This tweak only affects the "finit_module()" system call, which gives
         the kernel a file descriptor with the module data.
      
         You can also just feed the module data as raw data from user space
         with "init_module()" (note the lack of 'f' at the beginning), and
         obviously for that case we do _not_ have any "exclusive read" logic.
      
         So if you absolutely want to do things wrong in user space, and try
         to load the same module multiple times, and error out only later when
         the kernel ends up saying "you can't load the same module name
         twice", you can still do that.
      
         And in fact, some distros will do exactly that, because they will
         uncompress the kernel module data in user space before feeding it to
         the kernel (mainly because they haven't started using the new kernel
         side decompression yet).
      
         So this is not some absolute "you can't do concurrent loads of the
         same module". It's literally just a very simple heuristic that will
         catch it early in case you try to load the exact same module file at
         the same time, and in that case avoid a potentially nasty situation.
      
       - There is another user of "deny_write_access()": the verity code that
         enables fs-verity on a file (the FS_IOC_ENABLE_VERITY ioctl).
      
         If you use fs-verity and you care about verifying the kernel modules
         (which does make sense), you should do it *before* loading said
         kernel module. That may sound obvious, but now the implementation
         basically requires it. Because if you try to do it concurrently, the
         kernel may refuse to load the module file that is being set up by the
         fs-verity code.
      
       - This all will obviously mean that if you insist on loading the same
         module in parallel, only one module load will succeed, and the others
         will return with an error.
      
         That was true before too, but what is different is that the -ETXTBSY
         error can be returned *before* the success case of another process
         fully loading and instantiating the module.
      
         Again, that might sound obvious, and it is indeed the whole point of
         the whole change: we are much quicker to notice the whole "you're
         already in the process of loading this module".
      
         So it's very much intentional, but it does mean that if you just
         spray the kernel with "finit_module()", and expect that the module is
         immediately loaded afterwards without checking the return value, you
         are doing something horribly horribly wrong.
      
         I'd like to say that that would never happen, but the whole _reason_
         for this commit is that udev is currently doing something horribly
         horribly wrong, so ...
      
      Link: https://lore.kernel.org/all/ZEGopJ8VAYnE7LQ2@bombadil.infradead.org/ [1]
      Link: https://lore.kernel.org/all/23bd0ce6-ef78-1cd8-1f21-0e706a00424a@suse.com/ [2]
      Link: https://lore.kernel.org/lkml/ZG%2Fa+nrt4%2FAAUi5z@bombadil.infradead.org/
      
       [3]
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Lucas De Marchi <lucas.demarchi@intel.com>
      Cc: Petr Pavlu <petr.pavlu@suse.com>
      Tested-by: default avatarLuis Chamberlain <mcgrof@kernel.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9828ed3f
    • Linus Torvalds's avatar
      Merge tag 'vfs/v6.4-rc3/misc.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs · 9db89859
      Linus Torvalds authored
      Pull vfs fixes from Christian Brauner:
      
       - During the acl rework we merged this cycle the generic_listxattr()
         helper had to be modified in a way that in principle it would allow
         for POSIX ACLs to be reported. At least that was the impression we
         had initially. Because before the acl rework POSIX ACLs would be
         reported if the filesystem did have POSIX ACL xattr handlers in
         sb->s_xattr. That logic changed and now we can simply check whether
         the superblock has SB_POSIXACL set and if the inode has
         inode->i_{default_}acl set report the appropriate POSIX ACL name.
      
         However, we didn't realize that generic_listxattr() was only ever
         used by two filesystems. Both of them don't support POSIX ACLs via
         sb->s_xattr handlers and so never reported POSIX ACLs via
         generic_listxattr() even if they raised SB_POSIXACL and did contain
         inodes which had acls set. The example here is nfs4.
      
         As a result, generic_listxattr() suddenly started reporting POSIX
         ACLs when it wouldn't have before. Since SB_POSIXACL implies that the
         umask isn't stripped in the VFS nfs4 can't just drop SB_POSIXACL from
         the superblock as it would also alter umask handling for them.
      
         So just have generic_listxattr() not report POSIX ACLs as it never
         did anyway. It's documented as such.
      
       - Our SB_* flags currently use a signed integer and we shift the last
         bit causing UBSAN to complain about undefined behavior. Switch to
         using unsigned. While the original patch used an explicit unsigned
         bitshift it's now pretty common to rely on the BIT() macro in a lot
         of headers nowadays. So the patch has been adjusted to use that.
      
       - Add Namjae as ntfs reviewer. They're already active this cycle so
         let's make it explicit right now.
      
      * tag 'vfs/v6.4-rc3/misc.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
        ntfs: Add myself as a reviewer
        fs: don't call posix_acl_listxattr in generic_listxattr
        fs: fix undefined behavior in bit shift for SB_NOUSER
      9db89859