Commit 1ac0884d authored by Linus Torvalds's avatar Linus Torvalds
Browse files

Merge tag 'core-entry-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull core entry/exit updates from Thomas Gleixner:
 "A set of updates for entry/exit handling:

   - More generalization of entry/exit functionality

   - The consolidation work to reclaim TIF flags on x86 and also for
     non-x86 specific TIF flags which are solely relevant for syscall
     related work and have been moved into their own storage space. The
     x86 specific part had to be merged in to avoid a major conflict.

   - The TIF_NOTIFY_SIGNAL work which replaces the inefficient signal
     delivery mode of task work and results in an impressive performance
     improvement for io_uring. The non-x86 consolidation of this is
     going to come seperate via Jens.

   - The selective syscall redirection facility which provides a clean
     and efficient way to support the non-Linux syscalls of WINE by
     catching them at syscall entry and redirecting them to the user
     space emulation. This can be utilized for other purposes as well
     and has been designed carefully to avoid overhead for the regular
     fastpath. This includes the core changes and the x86 support code.

   - Simplification of the context tracking entry/exit handling for the
     users of the generic entry code which guarantee the proper ordering
     and protection.

   - Preparatory changes to make the generic entry code accomodate S390
     specific requirements which are mostly related to their syscall
     restart mechanism"

* tag 'core-entry-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
  entry: Add syscall_exit_to_user_mode_work()
  entry: Add exit_to_user_mode() wrapper
  entry_Add_enter_from_user_mode_wrapper
  entry: Rename exit_to_user_mode()
  entry: Rename enter_from_user_mode()
  docs: Document Syscall User Dispatch
  selftests: Add benchmark for syscall user dispatch
  selftests: Add kselftest for syscall user dispatch
  entry: Support Syscall User Dispatch on common syscall entry
  kernel: Implement selective syscall userspace redirection
  signal: Expose SYS_USER_DISPATCH si_code type
  x86: vdso: Expose sigreturn address on vdso to the kernel
  MAINTAINERS: Add entry for common entry code
  entry: Fix boot for !CONFIG_GENERIC_ENTRY
  x86: Support HAVE_CONTEXT_TRACKING_OFFSTACK
  context_tracking: Only define schedule_user() on !HAVE_CONTEXT_TRACKING_OFFSTACK archs
  sched: Detect call to schedule from critical entry code
  context_tracking: Don't implement exception_enter/exit() on CONFIG_HAVE_CONTEXT_TRACKING_OFFSTACK
  context_tracking: Introduce HAVE_CONTEXT_TRACKING_OFFSTACK
  x86: Reclaim unused x86 TI flags
  ...
parents ff613595 c6156e1d
Loading
Loading
Loading
Loading
+1 −0
Original line number Diff line number Diff line
@@ -113,6 +113,7 @@ configure specific aspects of kernel behavior to your liking.
   rtc
   serial-console
   svga
   syscall-user-dispatch
   sysrq
   thunderbolt
   ufs
+90 −0
Original line number Diff line number Diff line
.. SPDX-License-Identifier: GPL-2.0

=====================
Syscall User Dispatch
=====================

Background
----------

Compatibility layers like Wine need a way to efficiently emulate system
calls of only a part of their process - the part that has the
incompatible code - while being able to execute native syscalls without
a high performance penalty on the native part of the process.  Seccomp
falls short on this task, since it has limited support to efficiently
filter syscalls based on memory regions, and it doesn't support removing
filters.  Therefore a new mechanism is necessary.

Syscall User Dispatch brings the filtering of the syscall dispatcher
address back to userspace.  The application is in control of a flip
switch, indicating the current personality of the process.  A
multiple-personality application can then flip the switch without
invoking the kernel, when crossing the compatibility layer API
boundaries, to enable/disable the syscall redirection and execute
syscalls directly (disabled) or send them to be emulated in userspace
through a SIGSYS.

The goal of this design is to provide very quick compatibility layer
boundary crosses, which is achieved by not executing a syscall to change
personality every time the compatibility layer executes.  Instead, a
userspace memory region exposed to the kernel indicates the current
personality, and the application simply modifies that variable to
configure the mechanism.

There is a relatively high cost associated with handling signals on most
architectures, like x86, but at least for Wine, syscalls issued by
native Windows code are currently not known to be a performance problem,
since they are quite rare, at least for modern gaming applications.

Since this mechanism is designed to capture syscalls issued by
non-native applications, it must function on syscalls whose invocation
ABI is completely unexpected to Linux.  Syscall User Dispatch, therefore
doesn't rely on any of the syscall ABI to make the filtering.  It uses
only the syscall dispatcher address and the userspace key.

As the ABI of these intercepted syscalls is unknown to Linux, these
syscalls are not instrumentable via ptrace or the syscall tracepoints.

Interface
---------

A thread can setup this mechanism on supported kernels by executing the
following prctl:

  prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <offset>, <length>, [selector])

<op> is either PR_SYS_DISPATCH_ON or PR_SYS_DISPATCH_OFF, to enable and
disable the mechanism globally for that thread.  When
PR_SYS_DISPATCH_OFF is used, the other fields must be zero.

[<offset>, <offset>+<length>) delimit a memory region interval
from which syscalls are always executed directly, regardless of the
userspace selector.  This provides a fast path for the C library, which
includes the most common syscall dispatchers in the native code
applications, and also provides a way for the signal handler to return
without triggering a nested SIGSYS on (rt\_)sigreturn.  Users of this
interface should make sure that at least the signal trampoline code is
included in this region. In addition, for syscalls that implement the
trampoline code on the vDSO, that trampoline is never intercepted.

[selector] is a pointer to a char-sized region in the process memory
region, that provides a quick way to enable disable syscall redirection
thread-wide, without the need to invoke the kernel directly.  selector
can be set to PR_SYS_DISPATCH_ON or PR_SYS_DISPATCH_OFF.  Any other
value should terminate the program with a SIGSYS.

Security Notes
--------------

Syscall User Dispatch provides functionality for compatibility layers to
quickly capture system calls issued by a non-native part of the
application, while not impacting the Linux native regions of the
process.  It is not a mechanism for sandboxing system calls, and it
should not be seen as a security mechanism, since it is trivial for a
malicious application to subvert the mechanism by jumping to an allowed
dispatcher region prior to executing the syscall, or to discover the
address and modify the selector value.  If the use case requires any
kind of security sandboxing, Seccomp should be used instead.

Any fork or exec of the existing process resets the mechanism to
PR_SYS_DISPATCH_OFF.
+11 −0
Original line number Diff line number Diff line
@@ -7361,6 +7361,17 @@ S: Maintained
F:	drivers/base/arch_topology.c
F:	include/linux/arch_topology.h
GENERIC ENTRY CODE
M:	Thomas Gleixner <tglx@linutronix.de>
M:	Peter Zijlstra <peterz@infradead.org>
M:	Andy Lutomirski <luto@kernel.org>
L:	linux-kernel@vger.kernel.org
S:	Maintained
T:	git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git core/entry
F:	include/linux/entry-common.h
F:	include/linux/entry-kvm.h
F:	kernel/entry/
GENERIC GPIO I2C DRIVER
M:	Wolfram Sang <wsa+renesas@sang-engineering.com>
S:	Supported
+17 −0
Original line number Diff line number Diff line
@@ -618,6 +618,23 @@ config HAVE_CONTEXT_TRACKING
	  protected inside rcu_irq_enter/rcu_irq_exit() but preemption or signal
	  handling on irq exit still need to be protected.

config HAVE_CONTEXT_TRACKING_OFFSTACK
	bool
	help
	  Architecture neither relies on exception_enter()/exception_exit()
	  nor on schedule_user(). Also preempt_schedule_notrace() and
	  preempt_schedule_irq() can't be called in a preemptible section
	  while context tracking is CONTEXT_USER. This feature reflects a sane
	  entry implementation where the following requirements are met on
	  critical entry code, ie: before user_exit() or after user_enter():

	  - Critical entry code isn't preemptible (or better yet:
	    not interruptible).
	  - No use of RCU read side critical sections, unless rcu_nmi_enter()
	    got called.
	  - No use of instrumentation, unless instrumentation_begin() got
	    called.

config HAVE_TIF_NOHZ
	bool
	help
+1 −0
Original line number Diff line number Diff line
@@ -163,6 +163,7 @@ config X86
	select HAVE_CMPXCHG_DOUBLE
	select HAVE_CMPXCHG_LOCAL
	select HAVE_CONTEXT_TRACKING		if X86_64
	select HAVE_CONTEXT_TRACKING_OFFSTACK	if HAVE_CONTEXT_TRACKING
	select HAVE_C_RECORDMCOUNT
	select HAVE_DEBUG_KMEMLEAK
	select HAVE_DMA_CONTIGUOUS
Loading