Skip to content
  1. May 07, 2019
    • Christian Brauner's avatar
      clone: add CLONE_PIDFD · b3e58382
      Christian Brauner authored
      
      
      This patchset makes it possible to retrieve pid file descriptors at
      process creation time by introducing the new flag CLONE_PIDFD to the
      clone() system call.  Linus originally suggested to implement this as a
      new flag to clone() instead of making it a separate system call.  As
      spotted by Linus, there is exactly one bit for clone() left.
      
      CLONE_PIDFD creates file descriptors based on the anonymous inode
      implementation in the kernel that will also be used to implement the new
      mount api.  They serve as a simple opaque handle on pids.  Logically,
      this makes it possible to interpret a pidfd differently, narrowing or
      widening the scope of various operations (e.g. signal sending).  Thus, a
      pidfd cannot just refer to a tgid, but also a tid, or in theory - given
      appropriate flag arguments in relevant syscalls - a process group or
      session. A pidfd does not represent a privilege.  This does not imply it
      cannot ever be that way but for now this is not the case.
      
      A pidfd comes with additional information in fdinfo if the kernel supports
      procfs.  The fdinfo file contains the pid of the process in the callers
      pid namespace in the same format as the procfs status file, i.e. "Pid:\t%d".
      
      As suggested by Oleg, with CLONE_PIDFD the pidfd is returned in the
      parent_tidptr argument of clone.  This has the advantage that we can
      give back the associated pid and the pidfd at the same time.
      
      To remove worries about missing metadata access this patchset comes with
      a sample program that illustrates how a combination of CLONE_PIDFD, and
      pidfd_send_signal() can be used to gain race-free access to process
      metadata through /proc/<pid>.  The sample program can easily be
      translated into a helper that would be suitable for inclusion in libc so
      that users don't have to worry about writing it themselves.
      
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarChristian Brauner <christian@brauner.io>
      Co-developed-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarJann Horn <jannh@google.com>
      Reviewed-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
      Cc: Andy Lutomirsky <luto@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      b3e58382
  2. Apr 19, 2019
  3. Apr 08, 2019
  4. Apr 07, 2019
    • Linus Torvalds's avatar
      Merge branch 'i2c/for-current-fixed' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux · faac51dd
      Linus Torvalds authored
      Pull i2c fix from Wolfram Sang:
       "A simple but wanted driver bugfix"
      
      * 'i2c/for-current-fixed' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
        i2c: imx: don't leak the i2c adapter on error
      faac51dd
    • Linus Torvalds's avatar
      Merge branch 'parisc-5.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux · 373c3925
      Linus Torvalds authored
      Pull parisc fixes from Helge Deller:
       "A 32-bit boot regression fix introduced in the merge window, a QEMU
        detection fix and two fixes by Sven regarding ptrace & kprobes"
      
      * 'parisc-5.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
        parisc: Detect QEMU earlier in boot process
        parisc: also set iaoq_b in instruction_pointer_set()
        parisc: regs_return_value() should return gpr28
        Revert: parisc: Use F_EXTEND() macro in iosapic code
      373c3925
    • Helge Deller's avatar
      parisc: Detect QEMU earlier in boot process · d006e95b
      Helge Deller authored
      While adding LASI support to QEMU, I noticed that the QEMU detection in
      the kernel happens much too late. For example, when a LASI chip is found
      by the kernel, it registers the LASI LED driver as well.  But when we
      run on QEMU it makes sense to avoid spending unnecessary CPU cycles, so
      we need to access the running_on_QEMU flag earlier than before.
      
      This patch now makes the QEMU detection the fist task of the Linux
      kernel by moving it to where the kernel enters the C-coding.
      
      Fixes: 310d8278
      
       ("parisc: qemu idle sleep support")
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      Cc: stable@vger.kernel.org # v4.14+
      d006e95b
    • Sven Schnelle's avatar
      parisc: also set iaoq_b in instruction_pointer_set() · f324fa58
      Sven Schnelle authored
      
      
      When setting the instruction pointer on PA-RISC we also need
      to set the back of the instruction queue to the new offset, otherwise
      we will execute on instruction from the new location, and jumping
      back to the old location stored in iaoq_b.
      
      Signed-off-by: default avatarSven Schnelle <svens@stackframe.org>
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      Fixes: 75ebedf1 ("parisc: Add HAVE_REGS_AND_STACK_ACCESS_API feature")
      Cc: stable@vger.kernel.org # 4.19+
      f324fa58
    • Sven Schnelle's avatar
      parisc: regs_return_value() should return gpr28 · 45efd871
      Sven Schnelle authored
      
      
      While working on kretprobes for PA-RISC I was wondering while the
      kprobes sanity test always fails on kretprobes. This is caused by
      returning gpr20 instead of gpr28.
      
      Signed-off-by: default avatarSven Schnelle <svens@stackframe.org>
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      Cc: stable@vger.kernel.org # 4.14+
      45efd871
    • Helge Deller's avatar
      Revert: parisc: Use F_EXTEND() macro in iosapic code · c2f8d7cb
      Helge Deller authored
      Revert parts of commit 97d7e2e3
      
       ("parisc: Use F_EXTEND() macro in
      iosapic code"). It breaks booting the 32-bit kernel on some machines.
      
      Reported-by: default avatarSven Schnelle <svens@stackframe.org>
      Tested-by: default avatarSven Schnelle <svens@stackframe.org>
      Fixes: 97d7e2e3
      
       ("parisc: Use F_EXTEND() macro in iosapic code")
      Signed-off-by: default avatarHelge Deller <deller@gmx.de>
      c2f8d7cb
    • Kirill Smelkov's avatar
      fs: stream_open - opener for stream-like files so that read and write can run... · 10dce8af
      Kirill Smelkov authored
      fs: stream_open - opener for stream-like files so that read and write can run simultaneously without deadlock
      
      Commit 9c225f26 ("vfs: atomic f_pos accesses as per POSIX") added
      locking for file.f_pos access and in particular made concurrent read and
      write not possible - now both those functions take f_pos lock for the
      whole run, and so if e.g. a read is blocked waiting for data, write will
      deadlock waiting for that read to complete.
      
      This caused regression for stream-like files where previously read and
      write could run simultaneously, but after that patch could not do so
      anymore. See e.g. commit 581d21a2 ("xenbus: fix deadlock on writes
      to /proc/xen/xenbus") which fixes such regression for particular case of
      /proc/xen/xenbus.
      
      The patch that added f_pos lock in 2014 did so to guarantee POSIX thread
      safety for read/write/lseek and added the locking to file descriptors of
      all regular files. In 2014 that thread-safety problem was not new as it
      was already discussed earlier in 2006.
      
      However even though 2006'th version of Linus's patch was adding f_pos
      locking "only for files that are marked seekable with FMODE_LSEEK (thus
      avoiding the stream-like objects like pipes and sockets)", the 2014
      version - the one that actually made it into the tree as 9c225f26 -
      is doing so irregardless of whether a file is seekable or not.
      
      See
      
          https://lore.kernel.org/lkml/53022DB1.4070805@gmail.com/
          https://lwn.net/Articles/180387
          https://lwn.net/Articles/180396
      
      for historic context.
      
      The reason that it did so is, probably, that there are many files that
      are marked non-seekable, but e.g. their read implementation actually
      depends on knowing current position to correctly handle the read. Some
      examples:
      
      	kernel/power/user.c		snapshot_read
      	fs/debugfs/file.c		u32_array_read
      	fs/fuse/control.c		fuse_conn_waiting_read + ...
      	drivers/hwmon/asus_atk0110.c	atk_debugfs_ggrp_read
      	arch/s390/hypfs/inode.c		hypfs_read_iter
      	...
      
      Despite that, many nonseekable_open users implement read and write with
      pure stream semantics - they don't depend on passed ppos at all. And for
      those cases where read could wait for something inside, it creates a
      situation similar to xenbus - the write could be never made to go until
      read is done, and read is waiting for some, potentially external, event,
      for potentially unbounded time -> deadlock.
      
      Besides xenbus, there are 14 such places in the kernel that I've found
      with semantic patch (see below):
      
      	drivers/xen/evtchn.c:667:8-24: ERROR: evtchn_fops: .read() can deadlock .write()
      	drivers/isdn/capi/capi.c:963:8-24: ERROR: capi_fops: .read() can deadlock .write()
      	drivers/input/evdev.c:527:1-17: ERROR: evdev_fops: .read() can deadlock .write()
      	drivers/char/pcmcia/cm4000_cs.c:1685:7-23: ERROR: cm4000_fops: .read() can deadlock .write()
      	net/rfkill/core.c:1146:8-24: ERROR: rfkill_fops: .read() can deadlock .write()
      	drivers/s390/char/fs3270.c:488:1-17: ERROR: fs3270_fops: .read() can deadlock .write()
      	drivers/usb/misc/ldusb.c:310:1-17: ERROR: ld_usb_fops: .read() can deadlock .write()
      	drivers/hid/uhid.c:635:1-17: ERROR: uhid_fops: .read() can deadlock .write()
      	net/batman-adv/icmp_socket.c:80:1-17: ERROR: batadv_fops: .read() can deadlock .write()
      	drivers/media/rc/lirc_dev.c:198:1-17: ERROR: lirc_fops: .read() can deadlock .write()
      	drivers/leds/uleds.c:77:1-17: ERROR: uleds_fops: .read() can deadlock .write()
      	drivers/input/misc/uinput.c:400:1-17: ERROR: uinput_fops: .read() can deadlock .write()
      	drivers/infiniband/core/user_mad.c:985:7-23: ERROR: umad_fops: .read() can deadlock .write()
      	drivers/gnss/core.c:45:1-17: ERROR: gnss_fops: .read() can deadlock .write()
      
      In addition to the cases above another regression caused by f_pos
      locking is that now FUSE filesystems that implement open with
      FOPEN_NONSEEKABLE flag, can no longer implement bidirectional
      stream-like files - for the same reason as above e.g. read can deadlock
      write locking on file.f_pos in the kernel.
      
      FUSE's FOPEN_NONSEEKABLE was added in 2008 in a7c1b990 ("fuse:
      implement nonseekable open") to support OSSPD. OSSPD implements /dev/dsp
      in userspace with FOPEN_NONSEEKABLE flag, with corresponding read and
      write routines not depending on current position at all, and with both
      read and write being potentially blocking operations:
      
      See
      
          https://github.com/libfuse/osspd
          https://lwn.net/Articles/308445
      
          https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1406
          https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1438-L1477
          https://github.com/libfuse/osspd/blob/14a9cff0/osspd.c#L1479-L1510
      
      Corresponding libfuse example/test also describes FOPEN_NONSEEKABLE as
      "somewhat pipe-like files ..." with read handler not using offset.
      However that test implements only read without write and cannot exercise
      the deadlock scenario:
      
          https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L124-L131
          https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L146-L163
          https://github.com/libfuse/libfuse/blob/fuse-3.4.2-3-ga1bff7d/example/poll.c#L209-L216
      
      I've actually hit the read vs write deadlock for real while implementing
      my FUSE filesystem where there is /head/watch file, for which open
      creates separate bidirectional socket-like stream in between filesystem
      and its user with both read and write being later performed
      simultaneously. And there it is semantically not easy to split the
      stream into two separate read-only and write-only channels:
      
          https://lab.nexedi.com/kirr/wendelin.core/blob/f13aa600/wcfs/wcfs.go#L88-169
      
      Let's fix this regression. The plan is:
      
      1. We can't change nonseekable_open to include &~FMODE_ATOMIC_POS -
         doing so would break many in-kernel nonseekable_open users which
         actually use ppos in read/write handlers.
      
      2. Add stream_open() to kernel to open stream-like non-seekable file
         descriptors. Read and write on such file descriptors would never use
         nor change ppos. And with that property on stream-like files read and
         write will be running without taking f_pos lock - i.e. read and write
         could be running simultaneously.
      
      3. With semantic patch search and convert to stream_open all in-kernel
         nonseekable_open users for which read and write actually do not
         depend on ppos and where there is no other methods in file_operations
         which assume @offset access.
      
      4. Add FOPEN_STREAM to fs/fuse/ and open in-kernel file-descriptors via
         steam_open if that bit is present in filesystem open reply.
      
         It was tempting to change fs/fuse/ open handler to use stream_open
         instead of nonseekable_open on just FOPEN_NONSEEKABLE flags, but
         grepping through Debian codesearch shows users of FOPEN_NONSEEKABLE,
         and in particular GVFS which actually uses offset in its read and
         write handlers
      
      	https://codesearch.debian.net/search?q=-%3Enonseekable+%3D
      	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1080
      	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1247-1346
      	https://gitlab.gnome.org/GNOME/gvfs/blob/1.40.0-6-gcbc54396/client/gvfsfusedaemon.c#L1399-1481
      
         so if we would do such a change it will break a real user.
      
      5. Add stream_open and FOPEN_STREAM handling to stable kernels starting
         from v3.14+ (the kernel where 9c225f26
      
       first appeared).
      
         This will allow to patch OSSPD and other FUSE filesystems that
         provide stream-like files to return FOPEN_STREAM | FOPEN_NONSEEKABLE
         in their open handler and this way avoid the deadlock on all kernel
         versions. This should work because fs/fuse/ ignores unknown open
         flags returned from a filesystem and so passing FOPEN_STREAM to a
         kernel that is not aware of this flag cannot hurt. In turn the kernel
         that is not aware of FOPEN_STREAM will be < v3.14 where just
         FOPEN_NONSEEKABLE is sufficient to implement streams without read vs
         write deadlock.
      
      This patch adds stream_open, converts /proc/xen/xenbus to it and adds
      semantic patch to automatically locate in-kernel places that are either
      required to be converted due to read vs write deadlock, or that are just
      safe to be converted because read and write do not use ppos and there
      are no other funky methods in file_operations.
      
      Regarding semantic patch I've verified each generated change manually -
      that it is correct to convert - and each other nonseekable_open instance
      left - that it is either not correct to convert there, or that it is not
      converted due to current stream_open.cocci limitations.
      
      The script also does not convert files that should be valid to convert,
      but that currently have .llseek = noop_llseek or generic_file_llseek for
      unknown reason despite file being opened with nonseekable_open (e.g.
      drivers/input/mousedev.c)
      
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Yongzhi Pan <panyongzhi@gmail.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Julia Lawall <Julia.Lawall@lip6.fr>
      Cc: Nikolaus Rath <Nikolaus@rath.org>
      Cc: Han-Wen Nienhuys <hanwen@google.com>
      Signed-off-by: default avatarKirill Smelkov <kirr@nexedi.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      10dce8af
    • Guenter Roeck's avatar
      xsysace: Fix error handling in ace_setup · 47b16820
      Guenter Roeck authored
      If xace hardware reports a bad version number, the error handling code
      in ace_setup() calls put_disk(), followed by queue cleanup. However, since
      the disk data structure has the queue pointer set, put_disk() also
      cleans and releases the queue. This results in blk_cleanup_queue()
      accessing an already released data structure, which in turn may result
      in a crash such as the following.
      
      [   10.681671] BUG: Kernel NULL pointer dereference at 0x00000040
      [   10.681826] Faulting instruction address: 0xc0431480
      [   10.682072] Oops: Kernel access of bad area, sig: 11 [#1]
      [   10.682251] BE PAGE_SIZE=4K PREEMPT Xilinx Virtex440
      [   10.682387] Modules linked in:
      [   10.682528] CPU: 0 PID: 1 Comm: swapper Tainted: G        W         5.0.0-rc6-next-20190218+ #2
      [   10.682733] NIP:  c0431480 LR: c043147c CTR: c0422ad8
      [   10.682863] REGS: cf82fbe0 TRAP: 0300   Tainted: G        W          (5.0.0-rc6-next-20190218+)
      [   10.683065] MSR:  00029000 <CE,EE,ME>  CR: 22000222  XER: 00000000
      [   10.683236] DEAR: 00000040 ESR: 00000000
      [   10.683236] GPR00: c043147c cf82fc90 cf82ccc0 00000000 00000000 00000000 00000002 00000000
      [   10.683236] GPR08: 00000000 00000000 c04310bc 00000000 22000222 00000000 c0002c54 00000000
      [   10.683236] GPR16: 00000000 00000001 c09aa39c c09021b0 c09021dc 00000007 c0a68c08 00000000
      [   10.683236] GPR24: 00000001 ced6d400 ced6dcf0 c0815d9c 00000000 00000000 00000000 cedf0800
      [   10.684331] NIP [c0431480] blk_mq_run_hw_queue+0x28/0x114
      [   10.684473] LR [c043147c] blk_mq_run_hw_queue+0x24/0x114
      [   10.684602] Call Trace:
      [   10.684671] [cf82fc90] [c043147c] blk_mq_run_hw_queue+0x24/0x114 (unreliable)
      [   10.684854] [cf82fcc0] [c04315bc] blk_mq_run_hw_queues+0x50/0x7c
      [   10.685002] [cf82fce0] [c0422b24] blk_set_queue_dying+0x30/0x68
      [   10.685154] [cf82fcf0] [c0423ec0] blk_cleanup_queue+0x34/0x14c
      [   10.685306] [cf82fd10] [c054d73c] ace_probe+0x3dc/0x508
      [   10.685445] [cf82fd50] [c052d740] platform_drv_probe+0x4c/0xb8
      [   10.685592] [cf82fd70] [c052abb0] really_probe+0x20c/0x32c
      [   10.685728] [cf82fda0] [c052ae58] driver_probe_device+0x68/0x464
      [   10.685877] [cf82fdc0] [c052b500] device_driver_attach+0xb4/0xe4
      [   10.686024] [cf82fde0] [c052b5dc] __driver_attach+0xac/0xfc
      [   10.686161] [cf82fe00] [c0528428] bus_for_each_dev+0x80/0xc0
      [   10.686314] [cf82fe30] [c0529b3c] bus_add_driver+0x144/0x234
      [   10.686457] [cf82fe50] [c052c46c] driver_register+0x88/0x15c
      [   10.686610] [cf82fe60] [c09de288] ace_init+0x4c/0xac
      [   10.686742] [cf82fe80] [c0002730] do_one_initcall+0xac/0x330
      [   10.686888] [cf82fee0] [c09aafd0] kernel_init_freeable+0x34c/0x478
      [   10.687043] [cf82ff30] [c0002c6c] kernel_init+0x18/0x114
      [   10.687188] [cf82ff40] [c000f2f0] ret_from_kernel_thread+0x14/0x1c
      [   10.687349] Instruction dump:
      [   10.687435] 3863ffd4 4bfffd70 9421ffd0 7c0802a6 93c10028 7c9e2378 93e1002c 38810008
      [   10.687637] 7c7f1b78 90010034 4bfffc25 813f008c <81290040> 75290100 4182002c 80810008
      [   10.688056] ---[ end trace 13c9ff51d41b9d40 ]---
      
      Fix the problem by setting the disk queue pointer to NULL before calling
      put_disk(). A more comprehensive fix might be to rearrange the code
      to check the hardware version before initializing data structures,
      but I don't know if this would have undesirable side effects, and
      it would increase the complexity of backporting the fix to older kernels.
      
      Fixes: 74489a91
      
       ("Add support for Xilinx SystemACE CompactFlash interface")
      Acked-by: default avatarMichal Simek <michal.simek@xilinx.com>
      Signed-off-by: default avatarGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      47b16820
    • John Pittman's avatar
      null_blk: prevent crash from bad home_node value · 7ff684a6
      John Pittman authored
      
      
      At module load, if the selected home_node value is greater than
      the available numa nodes, the system will crash in
      __alloc_pages_nodemask() due to a bad paging request.  Prevent this
      user error crash by detecting the bad value, logging an error, and
      setting g_home_node back to the default of NUMA_NO_NODE.
      
      Signed-off-by: default avatarJohn Pittman <jpittman@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7ff684a6
    • Linus Torvalds's avatar
      Merge tag 'rtc-5.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux · be76865d
      Linus Torvalds authored
      Pull RTC fixes from Alexandre Belloni:
      
       - Various alarm fixes for da9063, cros-ec and sh
      
       - sd3078 manufacturer name fix as this was introduced this cycle
      
      * tag 'rtc-5.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux:
        rtc: da9063: set uie_unsupported when relevant
        rtc: sd3078: fix manufacturer name
        rtc: sh: Fix invalid alarm warning for non-enabled alarm
        rtc: cros-ec: Fail suspend/resume if wake IRQ can't be configured
      be76865d
  5. Apr 06, 2019