Commit 8cb1ae19 authored Nov 01, 2021 by Linus Torvalds

Merge tag 'x86-fpu-2021-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fpu updates from Thomas Gleixner:

 - Cleanup of extable fixup handling to be more robust, which in turn
   allows to make the FPU exception fixups more robust as well.

 - Change the return code for signal frame related failures from
   explicit error codes to a boolean fail/success as that's all what the
   calling code evaluates.

 - A large refactoring of the FPU code to prepare for adding AMX
   support:

      - Distangle the public header maze and remove especially the
        misnomed kitchen sink internal.h which is despite it's name
        included all over the place.

      - Add a proper abstraction for the register buffer storage (struct
        fpstate) which allows to dynamically size the buffer at runtime
        by flipping the pointer to the buffer container from the default
        container which is embedded in task_struct::tread::fpu to a
        dynamically allocated container with a larger register buffer.

      - Convert the code over to the new fpstate mechanism.

      - Consolidate the KVM FPU handling by moving the FPU related code
        into the FPU core which removes the number of exports and avoids
        adding even more export when AMX has to be supported in KVM.
        This also removes duplicated code which was of course
        unnecessary different and incomplete in the KVM copy.

      - Simplify the KVM FPU buffer handling by utilizing the new
        fpstate container and just switching the buffer pointer from the
        user space buffer to the KVM guest buffer when entering
        vcpu_run() and flipping it back when leaving the function. This
        cuts the memory requirements of a vCPU for FPU buffers in half
        and avoids pointless memory copy operations.

        This also solves the so far unresolved problem of adding AMX
        support because the current FPU buffer handling of KVM inflicted
        a circular dependency between adding AMX support to the core and
        to KVM. With the new scheme of switching fpstate AMX support can
        be added to the core code without affecting KVM.

      - Replace various variables with proper data structures so the
        extra information required for adding dynamically enabled FPU
        features (AMX) can be added in one place

 - Add AMX (Advanced Matrix eXtensions) support (finally):

   AMX is a large XSTATE component which is going to be available with
   Saphire Rapids XEON CPUs. The feature comes with an extra MSR
   (MSR_XFD) which allows to trap the (first) use of an AMX related
   instruction, which has two benefits:

    1) It allows the kernel to control access to the feature

    2) It allows the kernel to dynamically allocate the large register
       state buffer instead of burdening every task with the the extra
       8K or larger state storage.

   It would have been great to gain this kind of control already with
   AVX512.

   The support comes with the following infrastructure components:

    1) arch_prctl() to
        - read the supported features (equivalent to XGETBV(0))
        - read the permitted features for a task
        - request permission for a dynamically enabled feature

       Permission is granted per process, inherited on fork() and
       cleared on exec(). The permission policy of the kernel is
       restricted to sigaltstack size validation, but the syscall
       obviously allows further restrictions via seccomp etc.

    2) A stronger sigaltstack size validation for sys_sigaltstack(2)
       which takes granted permissions and the potentially resulting
       larger signal frame into account. This mechanism can also be used
       to enforce factual sigaltstack validation independent of dynamic
       features to help with finding potential victims of the 2K
       sigaltstack size constant which is broken since AVX512 support
       was added.

    3) Exception handling for #NM traps to catch first use of a extended
       feature via a new cause MSR. If the exception was caused by the
       use of such a feature, the handler checks permission for that
       feature. If permission has not been granted, the handler sends a
       SIGILL like the #UD handler would do if the feature would have
       been disabled in XCR0. If permission has been granted, then a new
       fpstate which fits the larger buffer requirement is allocated.

       In the unlikely case that this allocation fails, the handler
       sends SIGSEGV to the task. That's not elegant, but unavoidable as
       the other discussed options of preallocation or full per task
       permissions come with their own set of horrors for kernel and/or
       userspace. So this is the lesser of the evils and SIGSEGV caused
       by unexpected memory allocation failures is not a fundamentally
       new concept either.

       When allocation succeeds, the fpstate properties are filled in to
       reflect the extended feature set and the resulting sizes, the
       fpu::fpstate pointer is updated accordingly and the trap is
       disarmed for this task permanently.

    4) Enumeration and size calculations

    5) Trap switching via MSR_XFD

       The XFD (eXtended Feature Disable) MSR is context switched with
       the same life time rules as the FPU register state itself. The
       mechanism is keyed off with a static key which is default
       disabled so !AMX equipped CPUs have zero overhead. On AMX enabled
       CPUs the overhead is limited by comparing the tasks XFD value
       with a per CPU shadow variable to avoid redundant MSR writes. In
       case of switching from a AMX using task to a non AMX using task
       or vice versa, the extra MSR write is obviously inevitable.

       All other places which need to be aware of the variable feature
       sets and resulting variable sizes are not affected at all because
       they retrieve the information (feature set, sizes) unconditonally
       from the fpstate properties.

    6) Enable the new AMX states

   Note, this is relatively new code despite the fact that AMX support
   is in the works for more than a year now.

   The big refactoring of the FPU code, which allowed to do a proper
   integration has been started exactly 3 weeks ago. Refactoring of the
   existing FPU code and of the original AMX patches took a week and has
   been subject to extensive review and testing. The only fallout which
   has not been caught in review and testing right away was restricted
   to AMX enabled systems, which is completely irrelevant for anyone
   outside Intel and their early access program. There might be dragons
   lurking as usual, but so far the fine grained refactoring has held up
   and eventual yet undetected fallout is bisectable and should be
   easily addressable before the 5.16 release. Famous last words...

   Many thanks to Chang Bae and Dave Hansen for working hard on this and
   also to the various test teams at Intel who reserved extra capacity
   to follow the rapid development of this closely which provides the
   confidence level required to offer this rather large update for
   inclusion into 5.16-rc1

* tag 'x86-fpu-2021-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (110 commits)
  Documentation/x86: Add documentation for using dynamic XSTATE features
  x86/fpu: Include vmalloc.h for vzalloc()
  selftests/x86/amx: Add context switch test
  selftests/x86/amx: Add test cases for AMX state management
  x86/fpu/amx: Enable the AMX feature in 64-bit mode
  x86/fpu: Add XFD handling for dynamic states
  x86/fpu: Calculate the default sizes independently
  x86/fpu/amx: Define AMX state components and have it used for boot-time checks
  x86/fpu/xstate: Prepare XSAVE feature table for gaps in state component numbers
  x86/fpu/xstate: Add fpstate_realloc()/free()
  x86/fpu/xstate: Add XFD #NM handler
  x86/fpu: Update XFD state where required
  x86/fpu: Add sanity checks for XFD
  x86/fpu: Add XFD state to fpstate
  x86/msr-index: Add MSRs for XFD
  x86/cpufeatures: Add eXtended Feature Disabling (XFD) feature bit
  x86/fpu: Reset permission and fpstate on exec()
  x86/fpu: Prepare fpu_clone() for dynamically enabled features
  x86/fpu/signal: Prepare for variable sigframe length
  x86/signal: Use fpu::__state_user_size for sigalt stack validation
  ...

parents 7d20dd32 d7a9590f

Documentation/admin-guide/kernel-parameters.txt

+9 −0

Original line number	Diff line number	Diff line
		@@ -5497,6 +5497,15 @@
		stifb= [HW]
		Format: bpp:<bpp1>[:<bpp2>[:<bpp3>...]]

		strict_sas_size=
		[X86]
		Format: <bool>
		Enable or disable strict sigaltstack size checks
		against the required signal frame size which
		depends on the supported FPU features. This can
		be used to filter out binaries which have
		not yet been made aware of AT_MINSIGSTKSZ.

		sunrpc.min_resvport=
		sunrpc.max_resvport=
		[NFS,SUNRPC]

Documentation/x86/index.rst

+1 −0

Original line number	Diff line number	Diff line
		@@ -37,3 +37,4 @@ x86-specific Documentation
		sgx
		features
		elf_auxvec
		xstate

Documentation/x86/xstate.rst

0 → 100644

+65 −0

Original line number	Diff line number	Diff line
		Using XSTATE features in user space applications
		================================================

		The x86 architecture supports floating-point extensions which are
		enumerated via CPUID. Applications consult CPUID and use XGETBV to
		evaluate which features have been enabled by the kernel XCR0.

		Up to AVX-512 and PKRU states, these features are automatically enabled by
		the kernel if available. Features like AMX TILE_DATA (XSTATE component 18)
		are enabled by XCR0 as well, but the first use of related instruction is
		trapped by the kernel because by default the required large XSTATE buffers
		are not allocated automatically.

		Using dynamically enabled XSTATE features in user space applications
		--------------------------------------------------------------------

		The kernel provides an arch_prctl(2) based mechanism for applications to
		request the usage of such features. The arch_prctl(2) options related to
		this are:

		-ARCH_GET_XCOMP_SUPP

		arch_prctl(ARCH_GET_XCOMP_SUPP, &features);

		ARCH_GET_XCOMP_SUPP stores the supported features in userspace storage of
		type uint64_t. The second argument is a pointer to that storage.

		-ARCH_GET_XCOMP_PERM

		arch_prctl(ARCH_GET_XCOMP_PERM, &features);

		ARCH_GET_XCOMP_PERM stores the features for which the userspace process
		has permission in userspace storage of type uint64_t. The second argument
		is a pointer to that storage.

		-ARCH_REQ_XCOMP_PERM

		arch_prctl(ARCH_REQ_XCOMP_PERM, feature_nr);

		ARCH_REQ_XCOMP_PERM allows to request permission for a dynamically enabled
		feature or a feature set. A feature set can be mapped to a facility, e.g.
		AMX, and can require one or more XSTATE components to be enabled.

		The feature argument is the number of the highest XSTATE component which
		is required for a facility to work.

		When requesting permission for a feature, the kernel checks the
		availability. The kernel ensures that sigaltstacks in the process's tasks
		are large enough to accommodate the resulting large signal frame. It
		enforces this both during ARCH_REQ_XCOMP_SUPP and during any subsequent
		sigaltstack(2) calls. If an installed sigaltstack is smaller than the
		resulting sigframe size, ARCH_REQ_XCOMP_SUPP results in -ENOSUPP. Also,
		sigaltstack(2) results in -ENOMEM if the requested altstack is too small
		for the permitted features.

		Permission, when granted, is valid per process. Permissions are inherited
		on fork(2) and cleared on exec(3).

		The first use of an instruction related to a dynamically enabled feature is
		trapped by the kernel. The trap handler checks whether the process has
		permission to use the feature. If the process has no permission then the
		kernel sends SIGILL to the application. If the process has permission then
		the handler allocates a larger xstate buffer for the task so the large
		state can be context switched. In the unlikely cases that the allocation
		fails, the kernel sends SIGSEGV.

arch/Kconfig

+3 −0

Original line number	Diff line number	Diff line
		@@ -1288,6 +1288,9 @@ config ARCH_HAS_ELFCORE_COMPAT
		config ARCH_HAS_PARANOID_L1D_FLUSH
		bool

		config DYNAMIC_SIGFRAME
		bool

		source "kernel/gcov/Kconfig"

		source "scripts/gcc-plugins/Kconfig"

arch/x86/Kconfig

+17 −0

Original line number	Diff line number	Diff line
		@@ -125,6 +125,7 @@ config X86
		select CLOCKSOURCE_VALIDATE_LAST_CYCLE
		select CLOCKSOURCE_WATCHDOG
		select DCACHE_WORD_ACCESS
		select DYNAMIC_SIGFRAME
		select EDAC_ATOMIC_SCRUB
		select EDAC_SUPPORT
		select GENERIC_CLOCKEVENTS_BROADCAST if X86_64 \|\| (X86_32 && X86_LOCAL_APIC)
		@@ -2399,6 +2400,22 @@ config MODIFY_LDT_SYSCALL

		Saying 'N' here may make sense for embedded or server kernels.

		config STRICT_SIGALTSTACK_SIZE
		bool "Enforce strict size checking for sigaltstack"
		depends on DYNAMIC_SIGFRAME
		help
		For historical reasons MINSIGSTKSZ is a constant which became
		already too small with AVX512 support. Add a mechanism to
		enforce strict checking of the sigaltstack size against the
		real size of the FPU frame. This option enables the check
		by default. It can also be controlled via the kernel command
		line option 'strict_sas_size' independent of this config
		switch. Enabling it might break existing applications which
		allocate a too small sigaltstack but 'work' because they
		never get a signal delivered.

		Say 'N' unless you want to really enforce this check.

		source "kernel/livepatch/Kconfig"

		endmenu