Skip to content
  1. Apr 03, 2024
    • Oleksandr Tymoshenko's avatar
      efi: fix panic in kdump kernel · 9114ba99
      Oleksandr Tymoshenko authored
      [ Upstream commit 62b71cd7 ]
      
      Check if get_next_variable() is actually valid pointer before
      calling it. In kdump kernel this method is set to NULL that causes
      panic during the kexec-ed kernel boot.
      
      Tested with QEMU and OVMF firmware.
      
      Fixes: bad267f9
      
       ("efi: verify that variable services are supported")
      Signed-off-by: default avatarOleksandr Tymoshenko <ovt@google.com>
      Signed-off-by: default avatarArd Biesheuvel <ardb@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      9114ba99
    • Adamos Ttofari's avatar
      x86/fpu: Keep xfd_state in sync with MSR_IA32_XFD · 1acbca93
      Adamos Ttofari authored
      [ Upstream commit 10e4b516 ]
      
      Commit 67236547 ("x86/fpu: Update XFD state where required") and
      commit 8bf26758 ("x86/fpu: Add XFD state to fpstate") introduced a
      per CPU variable xfd_state to keep the MSR_IA32_XFD value cached, in
      order to avoid unnecessary writes to the MSR.
      
      On CPU hotplug MSR_IA32_XFD is reset to the init_fpstate.xfd, which
      wipes out any stale state. But the per CPU cached xfd value is not
      reset, which brings them out of sync.
      
      As a consequence a subsequent xfd_update_state() might fail to update
      the MSR which in turn can result in XRSTOR raising a #NM in kernel
      space, which crashes the kernel.
      
      To fix this, introduce xfd_set_state() to write xfd_state together
      with MSR_IA32_XFD, and use it in all places that set MSR_IA32_XFD.
      
      Fixes: 67236547
      
       ("x86/fpu: Update XFD state where required")
      Signed-off-by: default avatarAdamos Ttofari <attofari@amazon.de>
      Signed-off-by: default avatarChang S. Bae <chang.seok.bae@intel.com>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Link: https://lore.kernel.org/r/20240322230439.456571-1-chang.seok.bae@intel.com
      
      Closes: https://lore.kernel.org/lkml/20230511152818.13839-1-attofari@amazon.de
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      1acbca93
    • Thomas Gleixner's avatar
      x86/mpparse: Register APIC address only once · bebb5af0
      Thomas Gleixner authored
      [ Upstream commit f2208aa1 ]
      
      The APIC address is registered twice. First during the early detection and
      afterwards when actually scanning the table for APIC IDs. The APIC and
      topology core warn about the second attempt.
      
      Restrict it to the early detection call.
      
      Fixes: 81287ad6
      
       ("x86/apic: Sanitize APIC address setup")
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarBorislav Petkov (AMD) <bp@alien8.de>
      Tested-by: default avatarGuenter Roeck <linux@roeck-us.net>
      Link: https://lore.kernel.org/r/20240322185305.297774848@linutronix.de
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      bebb5af0
    • KONDO KAZUMA(近藤 和真)'s avatar
      efi/libstub: fix efi_random_alloc() to allocate memory at alloc_min or higher address · 31a6a791
      KONDO KAZUMA(近藤 和真) authored
      [ Upstream commit 3cb4a482 ]
      
      Following warning is sometimes observed while booting my servers:
        [    3.594838] DMA: preallocated 4096 KiB GFP_KERNEL pool for atomic allocations
        [    3.602918] swapper/0: page allocation failure: order:10, mode:0xcc1(GFP_KERNEL|GFP_DMA), nodemask=(null),cpuset=/,mems_allowed=0-1
        ...
        [    3.851862] DMA: preallocated 1024 KiB GFP_KERNEL|GFP_DMA pool for atomic allocation
      
      If 'nokaslr' boot option is set, the warning always happens.
      
      On x86, ZONE_DMA is small zone at the first 16MB of physical address
      space. When this problem happens, most of that space seems to be used by
      decompressed kernel. Thereby, there is not enough space at DMA_ZONE to
      meet the request of DMA pool allocation.
      
      The commit 2f77465b ("x86/efistub: Avoid placing the kernel below
      LOAD_PHYSICAL_ADDR") tried to fix this problem by introducing lower
      bound of allocation.
      
      But the fix is not complete.
      
      efi_random_alloc() allocates pages by following steps.
      1. Count total available slots ('total_slots')
      2. Select a slot ('target_slot') to allocate randomly
      3. Calculate a starting address ('target') to be included target_slot
      4. Allocate pages, which starting address is 'target'
      
      In step 1, 'alloc_min' is used to offset the starting address of memory
      chunk. But in step 3 'alloc_min' is not considered at all.  As the
      result, 'target' can be miscalculated and become lower than 'alloc_min'.
      
      When KASLR is disabled, 'target_slot' is always 0 and the problem
      happens everytime if the EFI memory map of the system meets the
      condition.
      
      Fix this problem by calculating 'target' considering 'alloc_min'.
      
      Cc: linux-efi@vger.kernel.org
      Cc: Tom Englund <tomenglund26@gmail.com>
      Cc: linux-kernel@vger.kernel.org
      Fixes: 2f77465b
      
       ("x86/efistub: Avoid placing the kernel below LOAD_PHYSICAL_ADDR")
      Signed-off-by: default avatarKazuma Kondo <kazuma-kondo@nec.com>
      Signed-off-by: default avatarArd Biesheuvel <ardb@kernel.org>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      31a6a791
    • Masami Hiramatsu (Google)'s avatar
      kprobes/x86: Use copy_from_kernel_nofault() to read from unsafe address · f13edd18
      Masami Hiramatsu (Google) authored
      [ Upstream commit 4e51653d ]
      
      Read from an unsafe address with copy_from_kernel_nofault() in
      arch_adjust_kprobe_addr() because this function is used before checking
      the address is in text or not. Syzcaller bot found a bug and reported
      the case if user specifies inaccessible data area,
      arch_adjust_kprobe_addr() will cause a kernel panic.
      
      [ mingo: Clarified the comment. ]
      
      Fixes: cc66bb91
      
       ("x86/ibt,kprobes: Cure sym+0 equals fentry woes")
      Reported-by: default avatarQiang Zhang <zzqq0103.hey@gmail.com>
      Tested-by: default avatarJinghao Jia <jinghao7@illinois.edu>
      Signed-off-by: default avatarMasami Hiramatsu (Google) <mhiramat@kernel.org>
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      Link: https://lore.kernel.org/r/171042945004.154897.2221804961882915806.stgit@devnote2
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      f13edd18
    • Biju Das's avatar
      irqchip/renesas-rzg2l: Prevent spurious interrupts when setting trigger type · 455b94f9
      Biju Das authored
      [ Upstream commit 853a6030 ]
      
      RZ/G2L interrupt chips require that the interrupt is masked before changing
      the NMI, IRQ, TINT interrupt settings. Aside of that, after setting an edge
      trigger type it is required to clear the interrupt status register in order
      to avoid spurious interrupts.
      
      The current implementation fails to do either of that and therefore is
      prone to generate spurious interrupts when setting the trigger type.
      
      Address this by:
      
        - Ensuring that the interrupt is masked at the chip level across the
          update for the TINT chip
      
        - Clearing the interrupt status register after updating the trigger mode
          for edge type interrupts
      
      [ tglx: Massaged changelog and reverted the spin_lock_irqsave() change as
        	the set_type() callback is always called with interrupts disabled. ]
      
      Fixes: 3fed0955
      
       ("irqchip: Add RZ/G2L IA55 Interrupt Controller driver")
      Signed-off-by: default avatarBiju Das <biju.das.jz@bp.renesas.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      455b94f9
    • Biju Das's avatar
      irqchip/renesas-rzg2l: Rename rzg2l_irq_eoi() · e9b18e99
      Biju Das authored
      [ Upstream commit b4b5cd61
      
       ]
      
      Rename rzg2l_irq_eoi()->rzg2l_clear_irq_int() and simplify the code by
      removing redundant priv local variable.
      
      Suggested-by: default avatarGeert Uytterhoeven <geert+renesas@glider.be>
      Signed-off-by: default avatarBiju Das <biju.das.jz@bp.renesas.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Stable-dep-of: 853a6030
      
       ("irqchip/renesas-rzg2l: Prevent spurious interrupts when setting trigger type")
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      e9b18e99
    • Biju Das's avatar
      irqchip/renesas-rzg2l: Rename rzg2l_tint_eoi() · ddec478f
      Biju Das authored
      [ Upstream commit 7cb6362c
      
       ]
      
      Rename rzg2l_tint_eoi()->rzg2l_clear_tint_int() and simplify the code by
      removing redundant priv and hw_irq local variables.
      
      Signed-off-by: default avatarBiju Das <biju.das.jz@bp.renesas.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarGeert Uytterhoeven <geert+renesas@glider.be>
      Stable-dep-of: 853a6030
      
       ("irqchip/renesas-rzg2l: Prevent spurious interrupts when setting trigger type")
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      ddec478f
    • Claudiu Beznea's avatar
      irqchip/renesas-rzg2l: Add macro to retrieve TITSR register offset based on register's index · ec5482d2
      Claudiu Beznea authored
      [ Upstream commit 2eca4731
      
       ]
      
      There are 2 TITSR registers available on the IA55 interrupt controller.
      
      Add a macro that retrieves the TITSR register offset based on it's
      index. This macro is useful in when adding suspend/resume support so both
      TITSR registers can be accessed in a for loop.
      
      Signed-off-by: default avatarClaudiu Beznea <claudiu.beznea.uj@bp.renesas.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Link: https://lore.kernel.org/r/20231120111820.87398-7-claudiu.beznea.uj@bp.renesas.com
      Stable-dep-of: 853a6030
      
       ("irqchip/renesas-rzg2l: Prevent spurious interrupts when setting trigger type")
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      ec5482d2
    • Biju Das's avatar
      irqchip/renesas-rzg2l: Flush posted write in irq_eoi() · 9913a078
      Biju Das authored
      [ Upstream commit 9eec61df ]
      
      The irq_eoi() callback of the RZ/G2L interrupt chip clears the relevant
      interrupt cause bit in the TSCR register by writing to it.
      
      This write is not sufficient because the write is posted and therefore not
      guaranteed to immediately clear the bit. Due to that delay the CPU can
      raise the just handled interrupt again.
      
      Prevent this by reading the register back which causes the posted write to
      be flushed to the hardware before the read completes.
      
      Fixes: 3fed0955
      
       ("irqchip: Add RZ/G2L IA55 Interrupt Controller driver")
      Signed-off-by: default avatarBiju Das <biju.das.jz@bp.renesas.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      9913a078
    • Claudiu Beznea's avatar
      irqchip/renesas-rzg2l: Implement restriction when writing ISCR register · c15a37e3
      Claudiu Beznea authored
      [ Upstream commit ef88eefb
      
       ]
      
      The RZ/G2L manual (chapter "IRQ Status Control Register (ISCR)") describes
      the operation to clear interrupts through the ISCR register as follows:
      
      [Write operation]
      
        When "Falling-edge detection", "Rising-edge detection" or
        "Falling/Rising-edge detection" is set in IITSR:
      
          - In case ISTAT is 1
      	0: IRQn interrupt detection status is cleared.
      	1: Invalid to write.
          - In case ISTAT is 0
      	Invalid to write.
      
        When "Low-level detection" is set in IITSR.:
              Invalid to write.
      
      Take the interrupt type into account when clearing interrupts through the
      ISCR register to avoid writing the ISCR when the interrupt type is level.
      
      Signed-off-by: default avatarClaudiu Beznea <claudiu.beznea.uj@bp.renesas.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: default avatarGeert Uytterhoeven <geert+renesas@glider.be>
      Link: https://lore.kernel.org/r/20231120111820.87398-6-claudiu.beznea.uj@bp.renesas.com
      Stable-dep-of: 9eec61df
      
       ("irqchip/renesas-rzg2l: Flush posted write in irq_eoi()")
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      c15a37e3
    • John Ogness's avatar
      printk: Update @console_may_schedule in console_trylock_spinning() · ea4c338c
      John Ogness authored
      [ Upstream commit 80769724
      
       ]
      
      console_trylock_spinning() may takeover the console lock from a
      schedulable context. Update @console_may_schedule to make sure it
      reflects a trylock acquire.
      
      Reported-by: default avatarMukesh Ojha <quic_mojha@quicinc.com>
      Closes: https://lore.kernel.org/lkml/20240222090538.23017-1-quic_mojha@quicinc.com
      Fixes: dbdda842
      
       ("printk: Add console owner and waiter logic to load balance console writes")
      Signed-off-by: default avatarJohn Ogness <john.ogness@linutronix.de>
      Reviewed-by: default avatarMukesh Ojha <quic_mojha@quicinc.com>
      Reviewed-by: default avatarPetr Mladek <pmladek@suse.com>
      Link: https://lore.kernel.org/r/875xybmo2z.fsf@jogness.linutronix.de
      Signed-off-by: default avatarPetr Mladek <pmladek@suse.com>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      ea4c338c
    • Nicolin Chen's avatar
      iommu/dma: Force swiotlb_max_mapping_size on an untrusted device · e07a16e6
      Nicolin Chen authored
      [ Upstream commit afc5aa46 ]
      
      The swiotlb does not support a mapping size > swiotlb_max_mapping_size().
      On the other hand, with a 64KB PAGE_SIZE configuration, it's observed that
      an NVME device can map a size between 300KB~512KB, which certainly failed
      the swiotlb mappings, though the default pool of swiotlb has many slots:
          systemd[1]: Started Journal Service.
       => nvme 0000:00:01.0: swiotlb buffer is full (sz: 327680 bytes), total 32768 (slots), used 32 (slots)
          note: journal-offline[392] exited with irqs disabled
          note: journal-offline[392] exited with preempt_count 1
      
      Call trace:
      [    3.099918]  swiotlb_tbl_map_single+0x214/0x240
      [    3.099921]  iommu_dma_map_page+0x218/0x328
      [    3.099928]  dma_map_page_attrs+0x2e8/0x3a0
      [    3.101985]  nvme_prep_rq.part.0+0x408/0x878 [nvme]
      [    3.102308]  nvme_queue_rqs+0xc0/0x300 [nvme]
      [    3.102313]  blk_mq_flush_plug_list.part.0+0x57c/0x600
      [    3.102321]  blk_add_rq_to_plug+0x180/0x2a0
      [    3.102323]  blk_mq_submit_bio+0x4c8/0x6b8
      [    3.103463]  __submit_bio+0x44/0x220
      [    3.103468]  submit_bio_noacct_nocheck+0x2b8/0x360
      [    3.103470]  submit_bio_noacct+0x180/0x6c8
      [    3.103471]  submit_bio+0x34/0x130
      [    3.103473]  ext4_bio_write_folio+0x5a4/0x8c8
      [    3.104766]  mpage_submit_folio+0xa0/0x100
      [    3.104769]  mpage_map_and_submit_buffers+0x1a4/0x400
      [    3.104771]  ext4_do_writepages+0x6a0/0xd78
      [    3.105615]  ext4_writepages+0x80/0x118
      [    3.105616]  do_writepages+0x90/0x1e8
      [    3.105619]  filemap_fdatawrite_wbc+0x94/0xe0
      [    3.105622]  __filemap_fdatawrite_range+0x68/0xb8
      [    3.106656]  file_write_and_wait_range+0x84/0x120
      [    3.106658]  ext4_sync_file+0x7c/0x4c0
      [    3.106660]  vfs_fsync_range+0x3c/0xa8
      [    3.106663]  do_fsync+0x44/0xc0
      
      Since untrusted devices might go down the swiotlb pathway with dma-iommu,
      these devices should not map a size larger than swiotlb_max_mapping_size.
      
      To fix this bug, add iommu_dma_max_mapping_size() for untrusted devices to
      take into account swiotlb_max_mapping_size() v.s. iova_rcache_range() from
      the iommu_dma_opt_mapping_size().
      
      Fixes: 82612d66
      
       ("iommu: Allow the dma-iommu api to use bounce buffers")
      Link: https://lore.kernel.org/r/ee51a3a5c32cf885b18f6416171802669f4a718a.1707851466.git.nicolinc@nvidia.com
      Signed-off-by: default avatarNicolin Chen <nicolinc@nvidia.com>
      [will: Drop redundant is_swiotlb_active(dev) check]
      Signed-off-by: default avatarWill Deacon <will@kernel.org>
      Reviewed-by: default avatarMichael Kelley <mhklinux@outlook.com>
      Acked-by: default avatarRobin Murphy <robin.murphy@arm.com>
      Tested-by: default avatarNicolin Chen <nicolinc@nvidia.com>
      Tested-by: default avatarMichael Kelley <mhklinux@outlook.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      e07a16e6
    • Will Deacon's avatar
      swiotlb: Fix alignment checks when both allocation and DMA masks are present · c803069d
      Will Deacon authored
      [ Upstream commit 51b30ecb ]
      
      Nicolin reports that swiotlb buffer allocations fail for an NVME device
      behind an IOMMU using 64KiB pages. This is because we end up with a
      minimum allocation alignment of 64KiB (for the IOMMU to map the buffer
      safely) but a minimum DMA alignment mask corresponding to a 4KiB NVME
      page (i.e. preserving the 4KiB page offset from the original allocation).
      If the original address is not 4KiB-aligned, the allocation will fail
      because swiotlb_search_pool_area() erroneously compares these unmasked
      bits with the 64KiB-aligned candidate allocation.
      
      Tweak swiotlb_search_pool_area() so that the DMA alignment mask is
      reduced based on the required alignment of the allocation.
      
      Fixes: 82612d66
      
       ("iommu: Allow the dma-iommu api to use bounce buffers")
      Link: https://lore.kernel.org/r/cover.1707851466.git.nicolinc@nvidia.com
      Reported-by: default avatarNicolin Chen <nicolinc@nvidia.com>
      Signed-off-by: default avatarWill Deacon <will@kernel.org>
      Reviewed-by: default avatarMichael Kelley <mhklinux@outlook.com>
      Tested-by: default avatarNicolin Chen <nicolinc@nvidia.com>
      Tested-by: default avatarMichael Kelley <mhklinux@outlook.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      c803069d
    • Will Deacon's avatar
      swiotlb: Honour dma_alloc_coherent() alignment in swiotlb_alloc() · ae2f8dbe
      Will Deacon authored
      [ Upstream commit cbf53074 ]
      
      core-api/dma-api-howto.rst states the following properties of
      dma_alloc_coherent():
      
        | The CPU virtual address and the DMA address are both guaranteed to
        | be aligned to the smallest PAGE_SIZE order which is greater than or
        | equal to the requested size.
      
      However, swiotlb_alloc() passes zero for the 'alloc_align_mask'
      parameter of swiotlb_find_slots() and so this property is not upheld.
      Instead, allocations larger than a page are aligned to PAGE_SIZE,
      
      Calculate the mask corresponding to the page order suitable for holding
      the allocation and pass that to swiotlb_find_slots().
      
      Fixes: e81e99ba
      
       ("swiotlb: Support aligned swiotlb buffers")
      Signed-off-by: default avatarWill Deacon <will@kernel.org>
      Reviewed-by: default avatarMichael Kelley <mhklinux@outlook.com>
      Reviewed-by: default avatarPetr Tesarik <petr.tesarik1@huawei-partners.com>
      Tested-by: default avatarNicolin Chen <nicolinc@nvidia.com>
      Tested-by: default avatarMichael Kelley <mhklinux@outlook.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      ae2f8dbe
    • Will Deacon's avatar
      swiotlb: Fix double-allocation of slots due to broken alignment handling · 3e7acd6e
      Will Deacon authored
      [ Upstream commit 04867a7a ]
      
      Commit bbb73a10 ("swiotlb: fix a braino in the alignment check fix"),
      which was a fix for commit 0eee5ae1 ("swiotlb: fix slot alignment
      checks"), causes a functional regression with vsock in a virtual machine
      using bouncing via a restricted DMA SWIOTLB pool.
      
      When virtio allocates the virtqueues for the vsock device using
      dma_alloc_coherent(), the SWIOTLB search can return page-unaligned
      allocations if 'area->index' was left unaligned by a previous allocation
      from the buffer:
      
       # Final address in brackets is the SWIOTLB address returned to the caller
       | virtio-pci 0000:00:07.0: orig_addr 0x0 alloc_size 0x2000, iotlb_align_mask 0x800 stride 0x2: got slot 1645-1649/7168 (0x98326800)
       | virtio-pci 0000:00:07.0: orig_addr 0x0 alloc_size 0x2000, iotlb_align_mask 0x800 stride 0x2: got slot 1649-1653/7168 (0x98328800)
       | virtio-pci 0000:00:07.0: orig_addr 0x0 alloc_size 0x2000, iotlb_align_mask 0x800 stride 0x2: got slot 1653-1657/7168 (0x9832a800)
      
      This ends badly (typically buffer corruption and/or a hang) because
      swiotlb_alloc() is expecting a page-aligned allocation and so blindly
      returns a pointer to the 'struct page' corresponding to the allocation,
      therefore double-allocating the first half (2KiB slot) of the 4KiB page.
      
      Fix the problem by treating the allocation alignment separately to any
      additional alignment requirements from the device, using the maximum
      of the two as the stride to search the buffer slots and taking care
      to ensure a minimum of page-alignment for buffers larger than a page.
      
      This also resolves swiotlb allocation failures occuring due to the
      inclusion of ~PAGE_MASK in 'iotlb_align_mask' for large allocations and
      resulting in alignment requirements exceeding swiotlb_max_mapping_size().
      
      Fixes: bbb73a10 ("swiotlb: fix a braino in the alignment check fix")
      Fixes: 0eee5ae1
      
       ("swiotlb: fix slot alignment checks")
      Signed-off-by: default avatarWill Deacon <will@kernel.org>
      Reviewed-by: default avatarMichael Kelley <mhklinux@outlook.com>
      Reviewed-by: default avatarPetr Tesarik <petr.tesarik1@huawei-partners.com>
      Tested-by: default avatarNicolin Chen <nicolinc@nvidia.com>
      Tested-by: default avatarMichael Kelley <mhklinux@outlook.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      3e7acd6e
    • André Rösti's avatar
      entry: Respect changes to system call number by trace_sys_enter() · 4da46308
      André Rösti authored
      [ Upstream commit fb13b11d ]
      
      When a probe is registered at the trace_sys_enter() tracepoint, and that
      probe changes the system call number, the old system call still gets
      executed.  This worked correctly until commit b6ec4134 ("core/entry:
      Report syscall correctly for trace and audit"), which removed the
      re-evaluation of the syscall number after the trace point.
      
      Restore the original semantics by re-evaluating the system call number
      after trace_sys_enter().
      
      The performance impact of this re-evaluation is minimal because it only
      takes place when a trace point is active, and compared to the actual trace
      point overhead the read from a cache hot variable is negligible.
      
      Fixes: b6ec4134
      
       ("core/entry: Report syscall correctly for trace and audit")
      Signed-off-by: default avatarAndré Rösti <an.roesti@gmail.com>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Link: https://lore.kernel.org/r/20240311211704.7262-1-an.roesti@gmail.com
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      4da46308
    • Yongqiang Liu's avatar
      ARM: 9359/1: flush: check if the folio is reserved for no-mapping addresses · 0c027c2b
      Yongqiang Liu authored
      [ Upstream commit 0c66c6f4 ]
      
      Since commit a4d5613c ("arm: extend pfn_valid to take into account
      freed memory map alignment") changes the semantics of pfn_valid() to check
      presence of the memory map for a PFN. A valid page for an address which
      is reserved but not mapped by the kernel[1], the system crashed during
      some uio test with the following memory layout:
      
       node   0: [mem 0x00000000c0a00000-0x00000000cc8fffff]
       node   0: [mem 0x00000000d0000000-0x00000000da1fffff]
       the uio layout is:0xc0900000, 0x100000
      
      the crash backtrace like:
      
        Unable to handle kernel paging request at virtual address bff00000
        [...]
        CPU: 1 PID: 465 Comm: startapp.bin Tainted: G           O      5.10.0 #1
        Hardware name: Generic DT based system
        PC is at b15_flush_kern_dcache_area+0x24/0x3c
        LR is at __sync_icache_dcache+0x6c/0x98
        [...]
         (b15_flush_kern_dcache_area) from (__sync_icache_dcache+0x6c/0x98)
         (__sync_icache_dcache) from (set_pte_at+0x28/0x54)
         (set_pte_at) from (remap_pfn_range+0x1a0/0x274)
         (remap_pfn_range) from (uio_mmap+0x184/0x1b8 [uio])
         (uio_mmap [uio]) from (__mmap_region+0x264/0x5f4)
         (__mmap_region) from (__do_mmap_mm+0x3ec/0x440)
         (__do_mmap_mm) from (do_mmap+0x50/0x58)
         (do_mmap) from (vm_mmap_pgoff+0xfc/0x188)
         (vm_mmap_pgoff) from (ksys_mmap_pgoff+0xac/0xc4)
         (ksys_mmap_pgoff) from (ret_fast_syscall+0x0/0x5c)
        Code: e0801001 e2423001 e1c00003 f57ff04f (ee070f3e)
        ---[ end trace 09cf0734c3805d52 ]---
        Kernel panic - not syncing: Fatal exception
      
      So check if PG_reserved was set to solve this issue.
      
      [1]: https://lore.kernel.org/lkml/Zbtdue57RO0QScJM@linux.ibm.com/
      
      Fixes: a4d5613c
      
       ("arm: extend pfn_valid to take into account freed memory map alignment")
      Suggested-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: default avatarYongqiang Liu <liuyongqiang13@huawei.com>
      Signed-off-by: default avatarRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      0c027c2b
    • Ard Biesheuvel's avatar
      ARM: 9352/1: iwmmxt: Remove support for PJ4/PJ4B cores · 66689127
      Ard Biesheuvel authored
      [ Upstream commit b9920fdd ]
      
      PJ4 is a v7 core that incorporates a iWMMXt coprocessor. However, GCC
      does not support this combination (its iWMMXt configuration always
      implies v5te), and so there is no v6/v7 user space that actually makes
      use of this, beyond generic support for things like setjmp() that
      preserve/restore the iWMMXt register file using generic LDC/STC
      instructions emitted in assembler.  As [0] appears to imply, this logic
      is triggered for the init process at boot, and so most user threads will
      have a iWMMXt register context associated with it, even though it is
      never used.
      
      At this point, it is highly unlikely that such GCC support will ever
      materialize (and Clang does not implement support for iWMMXt to begin
      with).
      
      This means that advertising iWMMXt support on these cores results in
      context switch overhead without any associated benefit, and so it is
      better to simply ignore the iWMMXt unit on these systems. So rip out the
      support. Doing so also fixes the issue reported in [0] related to UNDEF
      handling of co-processor #0/#1 instructions issued from user space
      running in Thumb2 mode.
      
      The PJ4 cores are used in four platforms: Armada 370/xp, Dove (Cubox,
      d2plug), MMP2 (xo-1.75) and Berlin (Google TV). Out of these, only the
      first is still widely used, but that one actually doesn't have iWMMXt
      but instead has only VFPV3-D16, and so it is not impacted by this
      change.
      
      Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218427 [0]
      
      Fixes: 8bcba70c
      
       ("ARM: entry: Disregard Thumb undef exception ...")
      Acked-by: default avatarLinus Walleij <linus.walleij@linaro.org>
      Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
      Acked-by: default avatarNicolas Pitre <nico@fluxnic.net>
      Reviewed-by: default avatarJisheng Zhang <jszhang@kernel.org>
      Signed-off-by: default avatarArd Biesheuvel <ardb@kernel.org>
      Signed-off-by: default avatarRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      66689127
    • Martin Blumenstingl's avatar
      clocksource/drivers/arm_global_timer: Fix maximum prescaler value · df13f436
      Martin Blumenstingl authored
      [ Upstream commit b34b9547 ]
      
      The prescaler in the "Global Timer Control Register bit assignments" is
      documented to use bits [15:8], which means that the maximum prescaler
      register value is 0xff.
      
      Fixes: 171b45a4
      
       ("clocksource/drivers/arm_global_timer: Implement rate compensation whenever source clock changes")
      Signed-off-by: default avatarMartin Blumenstingl <martin.blumenstingl@googlemail.com>
      Signed-off-by: default avatarDaniel Lezcano <daniel.lezcano@linaro.org>
      Link: https://lore.kernel.org/r/20240218174138.1942418-2-martin.blumenstingl@googlemail.com
      Signed-off-by: default avatarSasha Levin <sashal@kernel.org>
      df13f436
    • Ard Biesheuvel's avatar
      x86/sev: Fix position dependent variable references in startup code · 0982fd6b
      Ard Biesheuvel authored
      commit 1c811d40
      
       upstream.
      
      The early startup code executes from a 1:1 mapping of memory, which
      differs from the mapping that the code was linked and/or relocated to
      run at. The latter mapping is not active yet at this point, and so
      symbol references that rely on it will fault.
      
      Given that the core kernel is built without -fPIC, symbol references are
      typically emitted as absolute, and so any such references occuring in
      the early startup code will therefore crash the kernel.
      
      While an attempt was made to work around this for the early SEV/SME
      startup code, by forcing RIP-relative addressing for certain global
      SEV/SME variables via inline assembly (see snp_cpuid_get_table() for
      example), RIP-relative addressing must be pervasively enforced for
      SEV/SME global variables when accessed prior to page table fixups.
      
      __startup_64() already handles this issue for select non-SEV/SME global
      variables using fixup_pointer(), which adjusts the pointer relative to a
      `physaddr` argument. To avoid having to pass around this `physaddr`
      argument across all functions needing to apply pointer fixups, introduce
      a macro RIP_RELATIVE_REF() which generates a RIP-relative reference to
      a given global variable. It is used where necessary to force
      RIP-relative accesses to global variables.
      
      For backporting purposes, this patch makes no attempt at cleaning up
      other occurrences of this pattern, involving either inline asm or
      fixup_pointer(). Those will be addressed later.
      
        [ bp: Call it "rip_rel_ref" everywhere like other code shortens
          "rIP-relative reference" and make the asm wrapper __always_inline. ]
      
      Co-developed-by: default avatarKevin Loughlin <kevinloughlin@google.com>
      Signed-off-by: default avatarKevin Loughlin <kevinloughlin@google.com>
      Signed-off-by: default avatarArd Biesheuvel <ardb@kernel.org>
      Signed-off-by: default avatarBorislav Petkov (AMD) <bp@alien8.de>
      Cc: <stable@kernel.org>
      Link: https://lore.kernel.org/all/20240130220845.1978329-1-kevinloughlin@google.com
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0982fd6b
    • Borislav Petkov (AMD)'s avatar
      x86/Kconfig: Remove CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT · ecd16da3
      Borislav Petkov (AMD) authored
      commit 29956748
      
       upstream.
      
      It was meant well at the time but nothing's using it so get rid of it.
      
      Signed-off-by: default avatarBorislav Petkov (AMD) <bp@alien8.de>
      Acked-by: default avatarArd Biesheuvel <ardb@kernel.org>
      Link: https://lore.kernel.org/r/20240202163510.GDZb0Zvj8qOndvFOiZ@fat_crate.local
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ecd16da3
    • Alex Williamson's avatar
      vfio/fsl-mc: Block calling interrupt handler without trigger · ee0bd4ad
      Alex Williamson authored
      commit 7447d911 upstream.
      
      The eventfd_ctx trigger pointer of the vfio_fsl_mc_irq object is
      initially NULL and may become NULL if the user sets the trigger
      eventfd to -1.  The interrupt handler itself is guaranteed that
      trigger is always valid between request_irq() and free_irq(), but
      the loopback testing mechanisms to invoke the handler function
      need to test the trigger.  The triggering and setting ioctl paths
      both make use of igate and are therefore mutually exclusive.
      
      The vfio-fsl-mc driver does not make use of irqfds, nor does it
      support any sort of masking operations, therefore unlike vfio-pci
      and vfio-platform, the flow can remain essentially unchanged.
      
      Cc: Diana Craciun <diana.craciun@oss.nxp.com>
      Cc:  <stable@vger.kernel.org>
      Fixes: cc0ee20b
      
       ("vfio/fsl-mc: trigger an interrupt via eventfd")
      Reviewed-by: default avatarKevin Tian <kevin.tian@intel.com>
      Reviewed-by: default avatarEric Auger <eric.auger@redhat.com>
      Link: https://lore.kernel.org/r/20240308230557.805580-8-alex.williamson@redhat.com
      Signed-off-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ee0bd4ad
    • Alex Williamson's avatar
      vfio/platform: Create persistent IRQ handlers · 62d4e43a
      Alex Williamson authored
      commit 675daf43 upstream.
      
      The vfio-platform SET_IRQS ioctl currently allows loopback triggering of
      an interrupt before a signaling eventfd has been configured by the user,
      which thereby allows a NULL pointer dereference.
      
      Rather than register the IRQ relative to a valid trigger, register all
      IRQs in a disabled state in the device open path.  This allows mask
      operations on the IRQ to nest within the overall enable state governed
      by a valid eventfd signal.  This decouples @masked, protected by the
      @locked spinlock from @trigger, protected via the @igate mutex.
      
      In doing so, it's guaranteed that changes to @trigger cannot race the
      IRQ handlers because the IRQ handler is synchronously disabled before
      modifying the trigger, and loopback triggering of the IRQ via ioctl is
      safe due to serialization with trigger changes via igate.
      
      For compatibility, request_irq() failures are maintained to be local to
      the SET_IRQS ioctl rather than a fatal error in the open device path.
      This allows, for example, a userspace driver with polling mode support
      to continue to work regardless of moving the request_irq() call site.
      This necessarily blocks all SET_IRQS access to the failed index.
      
      Cc: Eric Auger <eric.auger@redhat.com>
      Cc:  <stable@vger.kernel.org>
      Fixes: 57f972e2
      
       ("vfio/platform: trigger an interrupt via eventfd")
      Reviewed-by: default avatarKevin Tian <kevin.tian@intel.com>
      Reviewed-by: default avatarEric Auger <eric.auger@redhat.com>
      Link: https://lore.kernel.org/r/20240308230557.805580-7-alex.williamson@redhat.com
      Signed-off-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      62d4e43a
    • Alex Williamson's avatar
      vfio/pci: Create persistent INTx handler · 69276a55
      Alex Williamson authored
      commit 18c198c9 upstream.
      
      A vulnerability exists where the eventfd for INTx signaling can be
      deconfigured, which unregisters the IRQ handler but still allows
      eventfds to be signaled with a NULL context through the SET_IRQS ioctl
      or through unmask irqfd if the device interrupt is pending.
      
      Ideally this could be solved with some additional locking; the igate
      mutex serializes the ioctl and config space accesses, and the interrupt
      handler is unregistered relative to the trigger, but the irqfd path
      runs asynchronous to those.  The igate mutex cannot be acquired from the
      atomic context of the eventfd wake function.  Disabling the irqfd
      relative to the eventfd registration is potentially incompatible with
      existing userspace.
      
      As a result, the solution implemented here moves configuration of the
      INTx interrupt handler to track the lifetime of the INTx context object
      and irq_type configuration, rather than registration of a particular
      trigger eventfd.  Synchronization is added between the ioctl path and
      eventfd_signal() wrapper such that the eventfd trigger can be
      dynamically updated relative to in-flight interrupts or irqfd callbacks.
      
      Cc:  <stable@vger.kernel.org>
      Fixes: 89e1f7d4
      
       ("vfio: Add PCI device driver")
      Reported-by: default avatarReinette Chatre <reinette.chatre@intel.com>
      Reviewed-by: default avatarKevin Tian <kevin.tian@intel.com>
      Reviewed-by: default avatarReinette Chatre <reinette.chatre@intel.com>
      Reviewed-by: default avatarEric Auger <eric.auger@redhat.com>
      Link: https://lore.kernel.org/r/20240308230557.805580-5-alex.williamson@redhat.com
      Signed-off-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      69276a55
    • Alex Williamson's avatar
      vfio: Introduce interface to flush virqfd inject workqueue · 2ee432d7
      Alex Williamson authored
      commit b620ecbd
      
       upstream.
      
      In order to synchronize changes that can affect the thread callback,
      introduce an interface to force a flush of the inject workqueue.  The
      irqfd pointer is only valid under spinlock, but the workqueue cannot
      be flushed under spinlock.  Therefore the flush work for the irqfd is
      queued under spinlock.  The vfio_irqfd_cleanup_wq workqueue is re-used
      for queuing this work such that flushing the workqueue is also ordered
      relative to shutdown.
      
      Reviewed-by: default avatarKevin Tian <kevin.tian@intel.com>
      Reviewed-by: default avatarReinette Chatre <reinette.chatre@intel.com>
      Reviewed-by: default avatarEric Auger <eric.auger@redhat.com>
      Link: https://lore.kernel.org/r/20240308230557.805580-4-alex.williamson@redhat.com
      Signed-off-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Stable-dep-of: 18c198c9
      
       ("vfio/pci: Create persistent INTx handler")
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2ee432d7
    • Josef Bacik's avatar
      btrfs: fix deadlock with fiemap and extent locking · ded566b4
      Josef Bacik authored
      commit b0ad381f
      
       upstream.
      
      While working on the patchset to remove extent locking I got a lockdep
      splat with fiemap and pagefaulting with my new extent lock replacement
      lock.
      
      This deadlock exists with our normal code, we just don't have lockdep
      annotations with the extent locking so we've never noticed it.
      
      Since we're copying the fiemap extent to user space on every iteration
      we have the chance of pagefaulting.  Because we hold the extent lock for
      the entire range we could mkwrite into a range in the file that we have
      mmap'ed.  This would deadlock with the following stack trace
      
      [<0>] lock_extent+0x28d/0x2f0
      [<0>] btrfs_page_mkwrite+0x273/0x8a0
      [<0>] do_page_mkwrite+0x50/0xb0
      [<0>] do_fault+0xc1/0x7b0
      [<0>] __handle_mm_fault+0x2fa/0x460
      [<0>] handle_mm_fault+0xa4/0x330
      [<0>] do_user_addr_fault+0x1f4/0x800
      [<0>] exc_page_fault+0x7c/0x1e0
      [<0>] asm_exc_page_fault+0x26/0x30
      [<0>] rep_movs_alternative+0x33/0x70
      [<0>] _copy_to_user+0x49/0x70
      [<0>] fiemap_fill_next_extent+0xc8/0x120
      [<0>] emit_fiemap_extent+0x4d/0xa0
      [<0>] extent_fiemap+0x7f8/0xad0
      [<0>] btrfs_fiemap+0x49/0x80
      [<0>] __x64_sys_ioctl+0x3e1/0xb50
      [<0>] do_syscall_64+0x94/0x1a0
      [<0>] entry_SYSCALL_64_after_hwframe+0x6e/0x76
      
      I wrote an fstest to reproduce this deadlock without my replacement lock
      and verified that the deadlock exists with our existing locking.
      
      To fix this simply don't take the extent lock for the entire duration of
      the fiemap.  This is safe in general because we keep track of where we
      are when we're searching the tree, so if an ordered extent updates in
      the middle of our fiemap call we'll still emit the correct extents
      because we know what offset we were on before.
      
      The only place we maintain the lock is searching delalloc.  Since the
      delalloc stuff can change during writeback we want to lock the extent
      range so we have a consistent view of delalloc at the time we're
      checking to see if we need to set the delalloc flag.
      
      With this patch applied we no longer deadlock with my testcase.
      
      CC: stable@vger.kernel.org # 6.1+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ded566b4
    • Darrick J. Wong's avatar
      xfs: remove conditional building of rt geometry validator functions · ea01221f
      Darrick J. Wong authored
      commit 881f78f4 upstream.
      
      [backport: resolve merge conflicts due to refactoring rtbitmap/summary
      macros and accessors]
      
      I mistakenly turned off CONFIG_XFS_RT in the Kconfig file for arm64
      variant of the djwong-wtf git branch.  Unfortunately, it took me a good
      hour to figure out that RT wasn't built because this is what got printed
      to dmesg:
      
      XFS (sda2): realtime geometry sanity check failed
      XFS (sda2): Metadata corruption detected at xfs_sb_read_verify+0x170/0x190 [xfs], xfs_sb block 0x0
      
      Whereas I would have expected:
      
      XFS (sda2): Not built with CONFIG_XFS_RT
      XFS (sda2): RT mount failed
      
      The root cause of these problems is the conditional compilation of the
      new functions xfs_validate_rtextents and xfs_compute_rextslog that I
      introduced in the two commits listed below.  The !RT versions of these
      functions return false and 0, respectively, which causes primary
      superblock validation to fail, which explains the first message.
      
      Move the two functions to other parts of libxfs that are not
      conditionally defined by CONFIG_XFS_RT and remove the broken stubs so
      that validation works again.
      
      Fixes: e1429380 ("xfs: don't allow overly small or large realtime volumes")
      Fixes: a6a38f30
      
       ("xfs: make rextslog computation consistent with mkfs")
      Signed-off-by: default avatar"Darrick J. Wong" <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>
      Signed-off-by: default avatarCatherine Hoang <catherine.hoang@oracle.com>
      Acked-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ea01221f
    • Andrey Albershteyn's avatar
      xfs: reset XFS_ATTR_INCOMPLETE filter on node removal · 9efd8426
      Andrey Albershteyn authored
      commit 82ef1a53 upstream.
      
      In XFS_DAS_NODE_REMOVE_ATTR case, xfs_attr_mode_remove_attr() sets
      filter to XFS_ATTR_INCOMPLETE. The filter is then reset in
      xfs_attr_complete_op() if XFS_DA_OP_REPLACE operation is performed.
      
      The filter is not reset though if XFS just removes the attribute
      (args->value == NULL) with xfs_attr_defer_remove(). attr code goes
      to XFS_DAS_DONE state.
      
      Fix this by always resetting XFS_ATTR_INCOMPLETE filter. The replace
      operation already resets this filter in anyway and others are
      completed at this step hence don't need it.
      
      Fixes: fdaf1bb3
      
       ("xfs: ATTR_REPLACE algorithm with LARP enabled needs rework")
      Signed-off-by: default avatarAndrey Albershteyn <aalbersh@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>
      Signed-off-by: default avatarCatherine Hoang <catherine.hoang@oracle.com>
      Acked-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9efd8426
    • Zhang Tianci's avatar
      xfs: update dir3 leaf block metadata after swap · 69252ab1
      Zhang Tianci authored
      commit 5759aa4f upstream.
      
      xfs_da3_swap_lastblock() copy the last block content to the dead block,
      but do not update the metadata in it. We need update some metadata
      for some kinds of type block, such as dir3 leafn block records its
      blkno, we shall update it to the dead block blkno. Otherwise,
      before write the xfs_buf to disk, the verify_write() will fail in
      blk_hdr->blkno != xfs_buf->b_bn, then xfs will be shutdown.
      
      We will get this warning:
      
        XFS (dm-0): Metadata corruption detected at xfs_dir3_leaf_verify+0xa8/0xe0 [xfs], xfs_dir3_leafn block 0x178
        XFS (dm-0): Unmount and run xfs_repair
        XFS (dm-0): First 128 bytes of corrupted metadata buffer:
        00000000e80f1917: 00 80 00 0b 00 80 00 07 3d ff 00 00 00 00 00 00  ........=.......
        000000009604c005: 00 00 00 00 00 00 01 a0 00 00 00 00 00 00 00 00  ................
        000000006b6fb2bf: e4 44 e3 97 b5 64 44 41 8b 84 60 0e 50 43 d9 bf  .D...dDA..`.PC..
        00000000678978a2: 00 00 00 00 00 00 00 83 01 73 00 93 00 00 00 00  .........s......
        00000000b28b247c: 99 29 1d 38 00 00 00 00 99 29 1d 40 00 00 00 00  .).8.....).@....
        000000002b2a662c: 99 29 1d 48 00 00 00 00 99 49 11 00 00 00 00 00  .).H.....I......
        00000000ea2ffbb8: 99 49 11 08 00 00 45 25 99 49 11 10 00 00 48 fe  .I....E%.I....H.
        0000000069e86440: 99 49 11 18 00 00 4c 6b 99 49 11 20 00 00 4d 97  .I....Lk.I. ..M.
        XFS (dm-0): xfs_do_force_shutdown(0x8) called from line 1423 of file fs/xfs/xfs_buf.c.  Return address = 00000000c0ff63c1
        XFS (dm-0): Corruption of in-memory data detected.  Shutting down filesystem
        XFS (dm-0): Please umount the filesystem and rectify the problem(s)
      
      >>From the log above, we know xfs_buf->b_no is 0x178, but the block's hdr record
      its blkno is 0x1a0.
      
      Fixes: 24df33b4
      
       ("xfs: add CRC checking to dir2 leaf blocks")
      Signed-off-by: default avatarZhang Tianci <zhangtianci.1997@bytedance.com>
      Suggested-by: default avatarDave Chinner <david@fromorbit.com>
      Reviewed-by: default avatar"Darrick J. Wong" <djwong@kernel.org>
      Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>
      Signed-off-by: default avatarCatherine Hoang <catherine.hoang@oracle.com>
      Acked-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      69252ab1
    • Jiachen Zhang's avatar
      xfs: ensure logflagsp is initialized in xfs_bmap_del_extent_real · 264e3509
      Jiachen Zhang authored
      commit e6af9c98 upstream.
      
      In the case of returning -ENOSPC, ensure logflagsp is initialized by 0.
      Otherwise the caller __xfs_bunmapi will set uninitialized illegal
      tmp_logflags value into xfs log, which might cause unpredictable error
      in the log recovery procedure.
      
      Also, remove the flags variable and set the *logflagsp directly, so that
      the code should be more robust in the long run.
      
      Fixes: 1b24b633
      
       ("xfs: move some more code into xfs_bmap_del_extent_real")
      Signed-off-by: default avatarJiachen Zhang <zhangjiachen.jaycee@bytedance.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatar"Darrick J. Wong" <djwong@kernel.org>
      Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>
      Signed-off-by: default avatarCatherine Hoang <catherine.hoang@oracle.com>
      Acked-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      264e3509
    • Long Li's avatar
      xfs: fix perag leak when growfs fails · 8a456679
      Long Li authored
      commit 78239218 upstream.
      
      During growfs, if new ag in memory has been initialized, however
      sb_agcount has not been updated, if an error occurs at this time it
      will cause perag leaks as follows, these new AGs will not been freed
      during umount , because of these new AGs are not visible(that is
      included in mp->m_sb.sb_agcount).
      
      unreferenced object 0xffff88810be40200 (size 512):
        comm "xfs_growfs", pid 857, jiffies 4294909093
        hex dump (first 32 bytes):
          00 c0 c1 05 81 88 ff ff 04 00 00 00 00 00 00 00  ................
          01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace (crc 381741e2):
          [<ffffffff8191aef6>] __kmalloc+0x386/0x4f0
          [<ffffffff82553e65>] kmem_alloc+0xb5/0x2f0
          [<ffffffff8238dac5>] xfs_initialize_perag+0xc5/0x810
          [<ffffffff824f679c>] xfs_growfs_data+0x9bc/0xbc0
          [<ffffffff8250b90e>] xfs_file_ioctl+0x5fe/0x14d0
          [<ffffffff81aa5194>] __x64_sys_ioctl+0x144/0x1c0
          [<ffffffff83c3d81f>] do_syscall_64+0x3f/0xe0
          [<ffffffff83e00087>] entry_SYSCALL_64_after_hwframe+0x62/0x6a
      unreferenced object 0xffff88810be40800 (size 512):
        comm "xfs_growfs", pid 857, jiffies 4294909093
        hex dump (first 32 bytes):
          20 00 00 00 00 00 00 00 57 ef be dc 00 00 00 00   .......W.......
          10 08 e4 0b 81 88 ff ff 10 08 e4 0b 81 88 ff ff  ................
        backtrace (crc bde50e2d):
          [<ffffffff8191b43a>] __kmalloc_node+0x3da/0x540
          [<ffffffff81814489>] kvmalloc_node+0x99/0x160
          [<ffffffff8286acff>] bucket_table_alloc.isra.0+0x5f/0x400
          [<ffffffff8286bdc5>] rhashtable_init+0x405/0x760
          [<ffffffff8238dda3>] xfs_initialize_perag+0x3a3/0x810
          [<ffffffff824f679c>] xfs_growfs_data+0x9bc/0xbc0
          [<ffffffff8250b90e>] xfs_file_ioctl+0x5fe/0x14d0
          [<ffffffff81aa5194>] __x64_sys_ioctl+0x144/0x1c0
          [<ffffffff83c3d81f>] do_syscall_64+0x3f/0xe0
          [<ffffffff83e00087>] entry_SYSCALL_64_after_hwframe+0x62/0x6a
      
      Factor out xfs_free_unused_perag_range() from xfs_initialize_perag(),
      used for freeing unused perag within a specified range in error handling,
      included in the error path of the growfs failure.
      
      Fixes: 1c1c6ebc
      
       ("xfs: Replace per-ag array with a radix tree")
      Signed-off-by: default avatarLong Li <leo.lilong@huawei.com>
      Reviewed-by: default avatar"Darrick J. Wong" <djwong@kernel.org>
      Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>
      Signed-off-by: default avatarCatherine Hoang <catherine.hoang@oracle.com>
      Acked-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8a456679
    • Long Li's avatar
      xfs: add lock protection when remove perag from radix tree · 59b115a7
      Long Li authored
      commit 07afd317
      
       upstream.
      
      Take mp->m_perag_lock for deletions from the perag radix tree in
      xfs_initialize_perag to prevent racing with tagging operations.
      Lookups are fine - they are RCU protected so already deal with the
      tree changing shape underneath the lookup - but tagging operations
      require the tree to be stable while the tags are propagated back up
      to the root.
      
      Right now there's nothing stopping radix tree tagging from operating
      while a growfs operation is progress and adding/removing new entries
      into the radix tree.
      
      Hence we can have traversals that require a stable tree occurring at
      the same time we are removing unused entries from the radix tree which
      causes the shape of the tree to change.
      
      Likely this hasn't caused a problem in the past because we are only
      doing append addition and removal so the active AG part of the tree
      is not changing shape, but that doesn't mean it is safe. Just making
      the radix tree modifications serialise against each other is obviously
      correct.
      
      Signed-off-by: default avatarLong Li <leo.lilong@huawei.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatar"Darrick J. Wong" <djwong@kernel.org>
      Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>
      Signed-off-by: default avatarCatherine Hoang <catherine.hoang@oracle.com>
      Acked-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      59b115a7
    • Eric Sandeen's avatar
      xfs: short circuit xfs_growfs_data_private() if delta is zero · c4848932
      Eric Sandeen authored
      commit 84712492 upstream.
      
      Although xfs_growfs_data() doesn't call xfs_growfs_data_private()
      if in->newblocks == mp->m_sb.sb_dblocks, xfs_growfs_data_private()
      further massages the new block count so that we don't i.e. try
      to create a too-small new AG.
      
      This may lead to a delta of "0" in xfs_growfs_data_private(), so
      we end up in the shrink case and emit the EXPERIMENTAL warning
      even if we're not changing anything at all.
      
      Fix this by returning straightaway if the block delta is zero.
      
      (nb: in older kernels, the result of entering the shrink case
      with delta == 0 may actually let an -ENOSPC escape to userspace,
      which is confusing for users.)
      
      Fixes: fb2fc172
      
       ("xfs: support shrinking unused space in the last AG")
      Signed-off-by: default avatarEric Sandeen <sandeen@redhat.com>
      Reviewed-by: default avatar"Darrick J. Wong" <djwong@kernel.org>
      Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>
      Signed-off-by: default avatarCatherine Hoang <catherine.hoang@oracle.com>
      Acked-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c4848932
    • Dave Chinner's avatar
      xfs: initialise di_crc in xfs_log_dinode · 47604cf2
      Dave Chinner authored
      commit 0573676f
      
       upstream.
      
      Alexander Potapenko report that KMSAN was issuing these warnings:
      
      kmalloc-ed xlog buffer of size 512 : ffff88802fc26200
      kmalloc-ed xlog buffer of size 368 : ffff88802fc24a00
      kmalloc-ed xlog buffer of size 648 : ffff88802b631000
      kmalloc-ed xlog buffer of size 648 : ffff88802b632800
      kmalloc-ed xlog buffer of size 648 : ffff88802b631c00
      xlog_write_iovec: copying 12 bytes from ffff888017ddbbd8 to ffff88802c300400
      xlog_write_iovec: copying 28 bytes from ffff888017ddbbe4 to ffff88802c30040c
      xlog_write_iovec: copying 68 bytes from ffff88802fc26274 to ffff88802c300428
      xlog_write_iovec: copying 188 bytes from ffff88802fc262bc to ffff88802c30046c
      =====================================================
      BUG: KMSAN: uninit-value in xlog_write_iovec fs/xfs/xfs_log.c:2227
      BUG: KMSAN: uninit-value in xlog_write_full fs/xfs/xfs_log.c:2263
      BUG: KMSAN: uninit-value in xlog_write+0x1fac/0x2600 fs/xfs/xfs_log.c:2532
       xlog_write_iovec fs/xfs/xfs_log.c:2227
       xlog_write_full fs/xfs/xfs_log.c:2263
       xlog_write+0x1fac/0x2600 fs/xfs/xfs_log.c:2532
       xlog_cil_write_chain fs/xfs/xfs_log_cil.c:918
       xlog_cil_push_work+0x30f2/0x44e0 fs/xfs/xfs_log_cil.c:1263
       process_one_work kernel/workqueue.c:2630
       process_scheduled_works+0x1188/0x1e30 kernel/workqueue.c:2703
       worker_thread+0xee5/0x14f0 kernel/workqueue.c:2784
       kthread+0x391/0x500 kernel/kthread.c:388
       ret_from_fork+0x66/0x80 arch/x86/kernel/process.c:147
       ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:242
      
      Uninit was created at:
       slab_post_alloc_hook+0x101/0xac0 mm/slab.h:768
       slab_alloc_node mm/slub.c:3482
       __kmem_cache_alloc_node+0x612/0xae0 mm/slub.c:3521
       __do_kmalloc_node mm/slab_common.c:1006
       __kmalloc+0x11a/0x410 mm/slab_common.c:1020
       kmalloc ./include/linux/slab.h:604
       xlog_kvmalloc fs/xfs/xfs_log_priv.h:704
       xlog_cil_alloc_shadow_bufs fs/xfs/xfs_log_cil.c:343
       xlog_cil_commit+0x487/0x4dc0 fs/xfs/xfs_log_cil.c:1574
       __xfs_trans_commit+0x8df/0x1930 fs/xfs/xfs_trans.c:1017
       xfs_trans_commit+0x30/0x40 fs/xfs/xfs_trans.c:1061
       xfs_create+0x15af/0x2150 fs/xfs/xfs_inode.c:1076
       xfs_generic_create+0x4cd/0x1550 fs/xfs/xfs_iops.c:199
       xfs_vn_create+0x4a/0x60 fs/xfs/xfs_iops.c:275
       lookup_open fs/namei.c:3477
       open_last_lookups fs/namei.c:3546
       path_openat+0x29ac/0x6180 fs/namei.c:3776
       do_filp_open+0x24d/0x680 fs/namei.c:3809
       do_sys_openat2+0x1bc/0x330 fs/open.c:1440
       do_sys_open fs/open.c:1455
       __do_sys_openat fs/open.c:1471
       __se_sys_openat fs/open.c:1466
       __x64_sys_openat+0x253/0x330 fs/open.c:1466
       do_syscall_x64 arch/x86/entry/common.c:51
       do_syscall_64+0x4f/0x140 arch/x86/entry/common.c:82
       entry_SYSCALL_64_after_hwframe+0x63/0x6b arch/x86/entry/entry_64.S:120
      
      Bytes 112-115 of 188 are uninitialized
      Memory access of size 188 starts at ffff88802fc262bc
      
      This is caused by the struct xfs_log_dinode not having the di_crc
      field initialised. Log recovery never uses this field (it is only
      present these days for on-disk format compatibility reasons) and so
      it's value is never checked so nothing in XFS has caught this.
      
      Further, none of the uninitialised memory access warning tools have
      caught this (despite catching other uninit memory accesses in the
      struct xfs_log_dinode back in 2017!) until recently. Alexander
      annotated the XFS code to get the dump of the actual bytes that were
      detected as uninitialised, and from that report it took me about 30s
      to realise what the issue was.
      
      The issue was introduced back in 2016 and every inode that is logged
      fails to initialise this field. This is no actual bad behaviour
      caused by this issue - I find it hard to even classify it as a
      bug...
      
      Reported-and-tested-by: default avatarAlexander Potapenko <glider@google.com>
      Fixes: f8d55aa0
      
       ("xfs: introduce inode log format object")
      Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
      Reviewed-by: default avatar"Darrick J. Wong" <djwong@kernel.org>
      Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>
      Signed-off-by: default avatarCatherine Hoang <catherine.hoang@oracle.com>
      Acked-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      47604cf2
    • Darrick J. Wong's avatar
      xfs: add missing nrext64 inode flag check to scrub · b9358db0
      Darrick J. Wong authored
      commit 576d30ec upstream.
      
      Add this missing check that the superblock nrext64 flag is set if the
      inode flag is set.
      
      Fixes: 9b7d16e3
      
       ("xfs: Introduce XFS_DIFLAG2_NREXT64 and associated helpers")
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarCatherine Hoang <catherine.hoang@oracle.com>
      Acked-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b9358db0
    • Darrick J. Wong's avatar
      xfs: force all buffers to be written during btree bulk load · 1a48327c
      Darrick J. Wong authored
      commit 13ae04d8
      
       upstream.
      
      While stress-testing online repair of btrees, I noticed periodic
      assertion failures from the buffer cache about buffers with incorrect
      DELWRI_Q state.  Looking further, I observed this race between the AIL
      trying to write out a btree block and repair zapping a btree block after
      the fact:
      
      AIL:    Repair0:
      
      pin buffer X
      delwri_queue:
      set DELWRI_Q
      add to delwri list
      
              stale buf X:
              clear DELWRI_Q
              does not clear b_list
              free space X
              commit
      
      delwri_submit   # oops
      
      Worse yet, I discovered that running the same repair over and over in a
      tight loop can result in a second race that cause data integrity
      problems with the repair:
      
      AIL:    Repair0:        Repair1:
      
      pin buffer X
      delwri_queue:
      set DELWRI_Q
      add to delwri list
      
              stale buf X:
              clear DELWRI_Q
              does not clear b_list
              free space X
              commit
      
                              find free space X
                              get buffer
                              rewrite buffer
                              delwri_queue:
                              set DELWRI_Q
                              already on a list, do not add
                              commit
      
                              BAD: committed tree root before all blocks written
      
      delwri_submit   # too late now
      
      I traced this to my own misunderstanding of how the delwri lists work,
      particularly with regards to the AIL's buffer list.  If a buffer is
      logged and committed, the buffer can end up on that AIL buffer list.  If
      btree repairs are run twice in rapid succession, it's possible that the
      first repair will invalidate the buffer and free it before the next time
      the AIL wakes up.  Marking the buffer stale clears DELWRI_Q from the
      buffer state without removing the buffer from its delwri list.  The
      buffer doesn't know which list it's on, so it cannot know which lock to
      take to protect the list for a removal.
      
      If the second repair allocates the same block, it will then recycle the
      buffer to start writing the new btree block.  Meanwhile, if the AIL
      wakes up and walks the buffer list, it will ignore the buffer because it
      can't lock it, and go back to sleep.
      
      When the second repair calls delwri_queue to put the buffer on the
      list of buffers to write before committing the new btree, it will set
      DELWRI_Q again, but since the buffer hasn't been removed from the AIL's
      buffer list, it won't add it to the bulkload buffer's list.
      
      This is incorrect, because the bulkload caller relies on delwri_submit
      to ensure that all the buffers have been sent to disk /before/
      committing the new btree root pointer.  This ordering requirement is
      required for data consistency.
      
      Worse, the AIL won't clear DELWRI_Q from the buffer when it does finally
      drop it, so the next thread to walk through the btree will trip over a
      debug assertion on that flag.
      
      To fix this, create a new function that waits for the buffer to be
      removed from any other delwri lists before adding the buffer to the
      caller's delwri list.  By waiting for the buffer to clear both the
      delwri list and any potential delwri wait list, we can be sure that
      repair will initiate writes of all buffers and report all write errors
      back to userspace instead of committing the new structure.
      
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarCatherine Hoang <catherine.hoang@oracle.com>
      Acked-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1a48327c
    • Darrick J. Wong's avatar
      xfs: fix an off-by-one error in xreap_agextent_binval · 7bc086bb
      Darrick J. Wong authored
      commit c0e37f07 upstream.
      
      Overall, this function tries to find and invalidate all buffers for a
      given extent of space on the data device.  The inner for loop in this
      function tries to find all xfs_bufs for a given daddr.  The lengths of
      all possible cached buffers range from 1 fsblock to the largest needed
      to contain a 64k xattr value (~17fsb).  The scan is capped to avoid
      looking at anything buffer going past the given extent.
      
      Unfortunately, the loop continuation test is wrong -- max_fsbs is the
      largest size we want to scan, not one past that.  Put another way, this
      loop is actually 1-indexed, not 0-indexed.  Therefore, the continuation
      test should use <=, not <.
      
      As a result, online repairs of btree blocks fails to stale any buffers
      for btrees that are being torn down, which causes later assertions in
      the buffer cache when another thread creates a different-sized buffer.
      This happens in xfs/709 when allocating an inode cluster buffer:
      
       ------------[ cut here ]------------
       WARNING: CPU: 0 PID: 3346128 at fs/xfs/xfs_message.c:104 assfail+0x3a/0x40 [xfs]
       CPU: 0 PID: 3346128 Comm: fsstress Not tainted 6.7.0-rc4-djwx #rc4
       RIP: 0010:assfail+0x3a/0x40 [xfs]
       Call Trace:
        <TASK>
        _xfs_buf_obj_cmp+0x4a/0x50
        xfs_buf_get_map+0x191/0xba0
        xfs_trans_get_buf_map+0x136/0x280
        xfs_ialloc_inode_init+0x186/0x340
        xfs_ialloc_ag_alloc+0x254/0x720
        xfs_dialloc+0x21f/0x870
        xfs_create_tmpfile+0x1a9/0x2f0
        xfs_rename+0x369/0xfd0
        xfs_vn_rename+0xfa/0x170
        vfs_rename+0x5fb/0xc30
        do_renameat2+0x52d/0x6e0
        __x64_sys_renameat2+0x4b/0x60
        do_syscall_64+0x3b/0xe0
        entry_SYSCALL_64_after_hwframe+0x46/0x4e
      
      A later refactoring patch in the online repair series fixed this by
      accident, which is why I didn't notice this until I started testing only
      the patches that are likely to end up in 6.8.
      
      Fixes: 1c7ce115
      
       ("xfs: reap large AG metadata extents when possible")
      Signed-off-by: default avatar"Darrick J. Wong" <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarChandan Babu R <chandanbabu@kernel.org>
      Signed-off-by: default avatarCatherine Hoang <catherine.hoang@oracle.com>
      Acked-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7bc086bb
    • Darrick J. Wong's avatar
      xfs: recompute growfsrtfree transaction reservation while growing rt volume · 84cd4f79
      Darrick J. Wong authored
      commit 578bd4ce
      
       upstream.
      
      While playing with growfs to create a 20TB realtime section on a
      filesystem that didn't previously have an rt section, I noticed that
      growfs would occasionally shut down the log due to a transaction
      reservation overflow.
      
      xfs_calc_growrtfree_reservation uses the current size of the realtime
      summary file (m_rsumsize) to compute the transaction reservation for a
      growrtfree transaction.  The reservations are computed at mount time,
      which means that m_rsumsize is zero when growfs starts "freeing" the new
      realtime extents into the rt volume.  As a result, the transaction is
      undersized and fails.
      
      Fix this by recomputing the transaction reservations every time we
      change m_rsumsize.
      
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarCatherine Hoang <catherine.hoang@oracle.com>
      Acked-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      84cd4f79
    • Darrick J. Wong's avatar
      xfs: remove unused fields from struct xbtree_ifakeroot · d6b65ed1
      Darrick J. Wong authored
      commit 4c8ecd1c upstream.
      
      Remove these unused fields since nobody uses them.  They should have
      been removed years ago in a different cleanup series from Christoph
      Hellwig.
      
      Fixes: daf83964 ("xfs: move the per-fork nextents fields into struct xfs_ifork")
      Fixes: f7e67b20
      
       ("xfs: move the fork format fields into struct xfs_ifork")
      Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
      Signed-off-by: default avatarCatherine Hoang <catherine.hoang@oracle.com>
      Acked-by: default avatarDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d6b65ed1