Skip to content
  1. Aug 21, 2022
    • Kuniyuki Iwashima's avatar
      kprobes: don't call disarm_kprobe() for disabled kprobes · 9c80e799
      Kuniyuki Iwashima authored
      The assumption in __disable_kprobe() is wrong, and it could try to disarm
      an already disarmed kprobe and fire the WARN_ONCE() below. [0]  We can
      easily reproduce this issue.
      
      1. Write 0 to /sys/kernel/debug/kprobes/enabled.
      
        # echo 0 > /sys/kernel/debug/kprobes/enabled
      
      2. Run execsnoop.  At this time, one kprobe is disabled.
      
        # /usr/share/bcc/tools/execsnoop &
        [1] 2460
        PCOMM            PID    PPID   RET ARGS
      
        # cat /sys/kernel/debug/kprobes/list
        ffffffff91345650  r  __x64_sys_execve+0x0    [FTRACE]
        ffffffff91345650  k  __x64_sys_execve+0x0    [DISABLED][FTRACE]
      
      3. Write 1 to /sys/kernel/debug/kprobes/enabled, which changes
         kprobes_all_disarmed to false but does not arm the disabled kprobe.
      
        # echo 1 > /sys/kernel/debug/kprobes/enabled
      
        # cat /sys/kernel/debug/kprobes/list
        ffffffff91345650  r  __x64_sys_execve+0x0    [FTRACE]
        ffffffff91345650  k  __x64_sys_execve+0x0    [DISABLED][FTRACE]
      
      4. Kill execsnoop, when __disable_kprobe() calls disarm_kprobe() for the
         disabled kprobe and hits the WARN_ONCE() in __disarm_kprobe_ftrace().
      
        # fg
        /usr/share/bcc/tools/execsnoop
        ^C
      
      Actually, WARN_ONCE() is fired twice, and __unregister_kprobe_top() misses
      some cleanups and leaves the aggregated kprobe in the hash table.  Then,
      __unregister_trace_kprobe() initialises tk->rp.kp.list and creates an
      infinite loop like this.
      
        aggregated kprobe.list -> kprobe.list -.
                                           ^    |
                                           '.__.'
      
      In this situation, these commands fall into the infinite loop and result
      in RCU stall or soft lockup.
      
        cat /sys/kernel/debug/kprobes/list : show_kprobe_addr() enters into the
                                             infinite loop with RCU.
      
        /usr/share/bcc/tools/execsnoop : warn_kprobe_rereg() holds kprobe_mutex,
                                         and __get_valid_kprobe() is stuck in
      				   the loop.
      
      To avoid the issue, make sure we don't call disarm_kprobe() for disabled
      kprobes.
      
      [0]
      Failed to disarm kprobe-ftrace at __x64_sys_execve+0x0/0x40 (error -2)
      WARNING: CPU: 6 PID: 2460 at kernel/kprobes.c:1130 __disarm_kprobe_ftrace.isra.19 (kernel/kprobes.c:1129)
      Modules linked in: ena
      CPU: 6 PID: 2460 Comm: execsnoop Not tainted 5.19.0+ #28
      Hardware name: Amazon EC2 c5.2xlarge/, BIOS 1.0 10/16/2017
      RIP: 0010:__disarm_kprobe_ftrace.isra.19 (kernel/kprobes.c:1129)
      Code: 24 8b 02 eb c1 80 3d c4 83 f2 01 00 75 d4 48 8b 75 00 89 c2 48 c7 c7 90 fa 0f 92 89 04 24 c6 05 ab 83 01 e8 e4 94 f0 ff <0f> 0b 8b 04 24 eb b1 89 c6 48 c7 c7 60 fa 0f 92 89 04 24 e8 cc 94
      RSP: 0018:ffff9e6ec154bd98 EFLAGS: 00010282
      RAX: 0000000000000000 RBX: ffffffff930f7b00 RCX: 0000000000000001
      RDX: 0000000080000001 RSI: ffffffff921461c5 RDI: 00000000ffffffff
      RBP: ffff89c504286da8 R08: 0000000000000000 R09: c0000000fffeffff
      R10: 0000000000000000 R11: ffff9e6ec154bc28 R12: ffff89c502394e40
      R13: ffff89c502394c00 R14: ffff9e6ec154bc00 R15: 0000000000000000
      FS:  00007fe800398740(0000) GS:ffff89c812d80000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 000000c00057f010 CR3: 0000000103b54006 CR4: 00000000007706e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      PKRU: 55555554
      Call Trace:
      <TASK>
       __disable_kprobe (kernel/kprobes.c:1716)
       disable_kprobe (kernel/kprobes.c:2392)
       __disable_trace_kprobe (kernel/trace/trace_kprobe.c:340)
       disable_trace_kprobe (kernel/trace/trace_kprobe.c:429)
       perf_trace_event_unreg.isra.2 (./include/linux/tracepoint.h:93 kernel/trace/trace_event_perf.c:168)
       perf_kprobe_destroy (kernel/trace/trace_event_perf.c:295)
       _free_event (kernel/events/core.c:4971)
       perf_event_release_kernel (kernel/events/core.c:5176)
       perf_release (kernel/events/core.c:5186)
       __fput (fs/file_table.c:321)
       task_work_run (./include/linux/sched.h:2056 (discriminator 1) kernel/task_work.c:179 (discriminator 1))
       exit_to_user_mode_prepare (./include/linux/resume_user_mode.h:49 kernel/entry/common.c:169 kernel/entry/common.c:201)
       syscall_exit_to_user_mode (./arch/x86/include/asm/jump_label.h:55 ./arch/x86/include/asm/nospec-branch.h:384 ./arch/x86/include/asm/entry-common.h:94 kernel/entry/common.c:133 kernel/entry/common.c:296)
       do_syscall_64 (arch/x86/entry/common.c:87)
       entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
      RIP: 0033:0x7fe7ff210654
      Code: 15 79 89 20 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb be 0f 1f 00 8b 05 9a cd 20 00 48 63 ff 85 c0 75 11 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3a f3 c3 48 83 ec 18 48 89 7c 24 08 e8 34 fc
      RSP: 002b:00007ffdbd1d3538 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
      RAX: 0000000000000000 RBX: 0000000000000008 RCX: 00007fe7ff210654
      RDX: 0000000000000000 RSI: 0000000000002401 RDI: 0000000000000008
      RBP: 0000000000000000 R08: 94ae31d6fda838a4 R0900007fe8001c9d30
      R10: 00007ffdbd1d34b0 R11: 0000000000000246 R12: 00007ffdbd1d3600
      R13: 0000000000000000 R14: fffffffffffffffc R15: 00007ffdbd1d3560
      </TASK>
      
      Link: https://lkml.kernel.org/r/20220813020509.90805-1-kuniyu@amazon.com
      Fixes: 69d54b91
      
       ("kprobes: makes kprobes/enabled works correctly for optimized kprobes.")
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.com>
      Reported-by: default avatarAyushman Dutta <ayudutta@amazon.com>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Wang Nan <wangnan0@huawei.com>
      Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
      Cc: Kuniyuki Iwashima <kuni1840@gmail.com>
      Cc: Ayushman Dutta <ayudutta@amazon.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9c80e799
    • Hugh Dickins's avatar
      mm/shmem: shmem_replace_page() remember NR_SHMEM · 76d36dea
      Hugh Dickins authored
      Elsewhere, NR_SHMEM is updated at the same time as shmem NR_FILE_PAGES;
      but shmem_replace_page() was forgetting to do that - so NR_SHMEM stats
      could grow too big or too small, in those unusual cases when it's used.
      
      Link: https://lkml.kernel.org/r/cec7c09d-5874-e160-ada6-6e10ee48784@google.com
      
      
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: "Darrick J. Wong" <djwong@kernel.org>
      Cc: Radoslaw Burny <rburny@google.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      76d36dea
    • Hugh Dickins's avatar
      mm/shmem: tmpfs fallocate use file_modified() · 15f242bb
      Hugh Dickins authored
      5.18 fixed the btrfs and ext4 fallocates to use file_modified(), as xfs
      was already doing, to drop privileges: and fstests generic/{683,684,688}
      expect this.  There's no need to argue over keep-size allocation (which
      could just update ctime): fix shmem_fallocate() to behave the same way.
      
      Link: https://lkml.kernel.org/r/39c5e62-4896-7795-c0a0-f79c50d4909@google.com
      
      
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarChristian Brauner (Microsoft) <brauner@kernel.org>
      Cc: "Darrick J. Wong" <djwong@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Radoslaw Burny <rburny@google.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      15f242bb
    • Hugh Dickins's avatar
      mm/shmem: fix chattr fsflags support in tmpfs · cb241339
      Hugh Dickins authored
      ext[234] have always allowed unimplemented chattr flags to be set, but
      other filesystems have tended to be stricter.  Follow the stricter
      approach for tmpfs: I don't want to have to explain why csu attributes
      don't actually work, and we won't need to update the chattr(1) manpage;
      and it's never wrong to start off strict, relaxing later if persuaded. 
      Allow only a (append only) i (immutable) A (no atime) and d (no dump).
      
      Although lsattr showed 'A' inherited, the NOATIME behavior was not being
      inherited: because nothing sync'ed FS_NOATIME_FL to S_NOATIME.  Add
      shmem_set_inode_flags() to sync the flags, using inode_set_flags() to
      avoid that instant of lost immutablility during fileattr_set().
      
      But that change switched generic/079 from passing to failing: because
      FS_IMMUTABLE_FL and FS_APPEND_FL had been unconventionally included in the
      INHERITED fsflags: remove them and generic/079 is back to passing.
      
      Link: https://lkml.kernel.org/r/2961dcb0-ddf3-b9f0-3268-12a4ff996856@google.com
      Fixes: e408e695
      
       ("mm/shmem: support FS_IOC_[SG]ETFLAGS in tmpfs")
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Radoslaw Burny <rburny@google.com>
      Cc: "Darrick J. Wong" <djwong@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      cb241339
    • David Hildenbrand's avatar
      mm/hugetlb: support write-faults in shared mappings · 1d8d1464
      David Hildenbrand authored
      If we ever get a write-fault on a write-protected page in a shared
      mapping, we'd be in trouble (again).  Instead, we can simply map the page
      writable.
      
      And in fact, there is even a way right now to trigger that code via
      uffd-wp ever since we stared to support it for shmem in 5.19:
      
      --------------------------------------------------------------------------
       #include <stdio.h>
       #include <stdlib.h>
       #include <string.h>
       #include <fcntl.h>
       #include <unistd.h>
       #include <errno.h>
       #include <sys/mman.h>
       #include <sys/syscall.h>
       #include <sys/ioctl.h>
       #include <linux/userfaultfd.h>
      
       #define HUGETLB_SIZE (2 * 1024 * 1024u)
      
       static char *map;
       int uffd;
      
       static int temp_setup_uffd(void)
       {
       	struct uffdio_api uffdio_api;
       	struct uffdio_register uffdio_register;
       	struct uffdio_writeprotect uffd_writeprotect;
       	struct uffdio_range uffd_range;
      
       	uffd = syscall(__NR_userfaultfd,
       		       O_CLOEXEC | O_NONBLOCK | UFFD_USER_MODE_ONLY);
       	if (uffd < 0) {
       		fprintf(stderr, "syscall() failed: %d\n", errno);
       		return -errno;
       	}
      
       	uffdio_api.api = UFFD_API;
       	uffdio_api.features = UFFD_FEATURE_PAGEFAULT_FLAG_WP;
       	if (ioctl(uffd, UFFDIO_API, &uffdio_api) < 0) {
       		fprintf(stderr, "UFFDIO_API failed: %d\n", errno);
       		return -errno;
       	}
      
       	if (!(uffdio_api.features & UFFD_FEATURE_PAGEFAULT_FLAG_WP)) {
       		fprintf(stderr, "UFFD_FEATURE_WRITEPROTECT missing\n");
       		return -ENOSYS;
       	}
      
       	/* Register UFFD-WP */
       	uffdio_register.range.start = (unsigned long) map;
       	uffdio_register.range.len = HUGETLB_SIZE;
       	uffdio_register.mode = UFFDIO_REGISTER_MODE_WP;
       	if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) < 0) {
       		fprintf(stderr, "UFFDIO_REGISTER failed: %d\n", errno);
       		return -errno;
       	}
      
       	/* Writeprotect a single page. */
       	uffd_writeprotect.range.start = (unsigned long) map;
       	uffd_writeprotect.range.len = HUGETLB_SIZE;
       	uffd_writeprotect.mode = UFFDIO_WRITEPROTECT_MODE_WP;
       	if (ioctl(uffd, UFFDIO_WRITEPROTECT, &uffd_writeprotect)) {
       		fprintf(stderr, "UFFDIO_WRITEPROTECT failed: %d\n", errno);
       		return -errno;
       	}
      
       	/* Unregister UFFD-WP without prior writeunprotection. */
       	uffd_range.start = (unsigned long) map;
       	uffd_range.len = HUGETLB_SIZE;
       	if (ioctl(uffd, UFFDIO_UNREGISTER, &uffd_range)) {
       		fprintf(stderr, "UFFDIO_UNREGISTER failed: %d\n", errno);
       		return -errno;
       	}
      
       	return 0;
       }
      
       int main(int argc, char **argv)
       {
       	int fd;
      
       	fd = open("/dev/hugepages/tmp", O_RDWR | O_CREAT);
       	if (!fd) {
       		fprintf(stderr, "open() failed\n");
       		return -errno;
       	}
       	if (ftruncate(fd, HUGETLB_SIZE)) {
       		fprintf(stderr, "ftruncate() failed\n");
       		return -errno;
       	}
      
       	map = mmap(NULL, HUGETLB_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
       	if (map == MAP_FAILED) {
       		fprintf(stderr, "mmap() failed\n");
       		return -errno;
       	}
      
       	*map = 0;
      
       	if (temp_setup_uffd())
       		return 1;
      
       	*map = 0;
      
       	return 0;
       }
      --------------------------------------------------------------------------
      
      Above test fails with SIGBUS when there is only a single free hugetlb page.
       # echo 1 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
       # ./test
       Bus error (core dumped)
      
      And worse, with sufficient free hugetlb pages it will map an anonymous page
      into a shared mapping, for example, messing up accounting during unmap
      and breaking MAP_SHARED semantics:
       # echo 2 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
       # ./test
       # cat /proc/meminfo | grep HugePages_
       HugePages_Total:       2
       HugePages_Free:        1
       HugePages_Rsvd:    18446744073709551615
       HugePages_Surp:        0
      
      Reason is that uffd-wp doesn't clear the uffd-wp PTE bit when
      unregistering and consequently keeps the PTE writeprotected.  Reason for
      this is to avoid the additional overhead when unregistering.  Note that
      this is the case also for !hugetlb and that we will end up with writable
      PTEs that still have the uffd-wp PTE bit set once we return from
      hugetlb_wp().  I'm not touching the uffd-wp PTE bit for now, because it
      seems to be a generic thing -- wp_page_reuse() also doesn't clear it.
      
      VM_MAYSHARE handling in hugetlb_fault() for FAULT_FLAG_WRITE indicates
      that MAP_SHARED handling was at least envisioned, but could never have
      worked as expected.
      
      While at it, make sure that we never end up in hugetlb_wp() on write
      faults without VM_WRITE, because we don't support maybe_mkwrite()
      semantics as commonly used in the !hugetlb case -- for example, in
      wp_page_reuse().
      
      Note that there is no need to do any kind of reservation in
      hugetlb_fault() in this case ...  because we already have a hugetlb page
      mapped R/O that we will simply map writable and we are not dealing with
      COW/unsharing.
      
      Link: https://lkml.kernel.org/r/20220811103435.188481-3-david@redhat.com
      Fixes: b1f9e876
      
       ("mm/uffd: enable write protection for shmem & hugetlbfs")
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jamie Liu <jamieliu@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: <stable@vger.kernel.org>	[5.19]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1d8d1464
    • David Hildenbrand's avatar
      mm/hugetlb: fix hugetlb not supporting softdirty tracking · f96f7a40
      David Hildenbrand authored
      Patch series "mm/hugetlb: fix write-fault handling for shared mappings", v2.
      
      I observed that hugetlb does not support/expect write-faults in shared
      mappings that would have to map the R/O-mapped page writable -- and I
      found two case where we could currently get such faults and would
      erroneously map an anon page into a shared mapping.
      
      Reproducers part of the patches.
      
      I propose to backport both fixes to stable trees.  The first fix needs a
      small adjustment.
      
      
      This patch (of 2):
      
      Staring at hugetlb_wp(), one might wonder where all the logic for shared
      mappings is when stumbling over a write-protected page in a shared
      mapping.  In fact, there is none, and so far we thought we could get away
      with that because e.g., mprotect() should always do the right thing and
      map all pages directly writable.
      
      Looks like we were wrong:
      
      --------------------------------------------------------------------------
       #include <stdio.h>
       #include <stdlib.h>
       #include <string.h>
       #include <fcntl.h>
       #include <unistd.h>
       #include <errno.h>
       #include <sys/mman.h>
      
       #define HUGETLB_SIZE (2 * 1024 * 1024u)
      
       static void clear_softdirty(void)
       {
               int fd = open("/proc/self/clear_refs", O_WRONLY);
               const char *ctrl = "4";
               int ret;
      
               if (fd < 0) {
                       fprintf(stderr, "open(clear_refs) failed\n");
                       exit(1);
               }
               ret = write(fd, ctrl, strlen(ctrl));
               if (ret != strlen(ctrl)) {
                       fprintf(stderr, "write(clear_refs) failed\n");
                       exit(1);
               }
               close(fd);
       }
      
       int main(int argc, char **argv)
       {
               char *map;
               int fd;
      
               fd = open("/dev/hugepages/tmp", O_RDWR | O_CREAT);
               if (!fd) {
                       fprintf(stderr, "open() failed\n");
                       return -errno;
               }
               if (ftruncate(fd, HUGETLB_SIZE)) {
                       fprintf(stderr, "ftruncate() failed\n");
                       return -errno;
               }
      
               map = mmap(NULL, HUGETLB_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
               if (map == MAP_FAILED) {
                       fprintf(stderr, "mmap() failed\n");
                       return -errno;
               }
      
               *map = 0;
      
               if (mprotect(map, HUGETLB_SIZE, PROT_READ)) {
                       fprintf(stderr, "mmprotect() failed\n");
                       return -errno;
               }
      
               clear_softdirty();
      
               if (mprotect(map, HUGETLB_SIZE, PROT_READ|PROT_WRITE)) {
                       fprintf(stderr, "mmprotect() failed\n");
                       return -errno;
               }
      
               *map = 0;
      
               return 0;
       }
      --------------------------------------------------------------------------
      
      Above test fails with SIGBUS when there is only a single free hugetlb page.
       # echo 1 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
       # ./test
       Bus error (core dumped)
      
      And worse, with sufficient free hugetlb pages it will map an anonymous page
      into a shared mapping, for example, messing up accounting during unmap
      and breaking MAP_SHARED semantics:
       # echo 2 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
       # ./test
       # cat /proc/meminfo | grep HugePages_
       HugePages_Total:       2
       HugePages_Free:        1
       HugePages_Rsvd:    18446744073709551615
       HugePages_Surp:        0
      
      Reason in this particular case is that vma_wants_writenotify() will
      return "true", removing VM_SHARED in vma_set_page_prot() to map pages
      write-protected. Let's teach vma_wants_writenotify() that hugetlb does not
      support softdirty tracking.
      
      Link: https://lkml.kernel.org/r/20220811103435.188481-1-david@redhat.com
      Link: https://lkml.kernel.org/r/20220811103435.188481-2-david@redhat.com
      Fixes: 64e45507
      
       ("mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared")
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Jamie Liu <jamieliu@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: <stable@vger.kernel.org>	[3.18+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f96f7a40
    • Peter Xu's avatar
      mm/uffd: reset write protection when unregister with wp-mode · f369b07c
      Peter Xu authored
      The motivation of this patch comes from a recent report and patchfix from
      David Hildenbrand on hugetlb shared handling of wr-protected page [1].
      
      With the reproducer provided in commit message of [1], one can leverage
      the uffd-wp lazy-reset of ptes to trigger a hugetlb issue which can affect
      not only the attacker process, but also the whole system.
      
      The lazy-reset mechanism of uffd-wp was used to make unregister faster,
      meanwhile it has an assumption that any leftover pgtable entries should
      only affect the process on its own, so not only the user should be aware
      of anything it does, but also it should not affect outside of the process.
      
      But it seems that this is not true, and it can also be utilized to make
      some exploit easier.
      
      So far there's no clue showing that the lazy-reset is important to any
      userfaultfd users because normally the unregister will only happen once
      for a specific range of memory of the lifecycle of the process.
      
      Considering all above, what this patch proposes is to do explicit pte
      resets when unregister an uffd region with wr-protect mode enabled.
      
      It should be the same as calling ioctl(UFFDIO_WRITEPROTECT, wp=false)
      right before ioctl(UFFDIO_UNREGISTER) for the user.  So potentially it'll
      make the unregister slower.  From that pov it's a very slight abi change,
      but hopefully nothing should break with this change either.
      
      Regarding to the change itself - core of uffd write [un]protect operation
      is moved into a separate function (uffd_wp_range()) and it is reused in
      the unregister code path.
      
      Note that the new function will not check for anything, e.g.  ranges or
      memory types, because they should have been checked during the previous
      UFFDIO_REGISTER or it should have failed already.  It also doesn't check
      mmap_changing because we're with mmap write lock held anyway.
      
      I added a Fixes upon introducing of uffd-wp shmem+hugetlbfs because that's
      the only issue reported so far and that's the commit David's reproducer
      will start working (v5.19+).  But the whole idea actually applies to not
      only file memories but also anonymous.  It's just that we don't need to
      fix anonymous prior to v5.19- because there's no known way to exploit.
      
      IOW, this patch can also fix the issue reported in [1] as the patch 2 does.
      
      [1] https://lore.kernel.org/all/20220811103435.188481-3-david@redhat.com/
      
      Link: https://lkml.kernel.org/r/20220811201340.39342-1-peterx@redhat.com
      Fixes: b1f9e876
      
       ("mm/uffd: enable write protection for shmem & hugetlbfs")
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f369b07c
    • Peter Xu's avatar
      mm/smaps: don't access young/dirty bit if pte unpresent · efd41493
      Peter Xu authored
      These bits should only be valid when the ptes are present.  Introducing
      two booleans for it and set it to false when !pte_present() for both pte
      and pmd accountings.
      
      The bug is found during code reading and no real world issue reported, but
      logically such an error can cause incorrect readings for either smaps or
      smaps_rollup output on quite a few fields.
      
      For example, it could cause over-estimate on values like Shared_Dirty,
      Private_Dirty, Referenced.  Or it could also cause under-estimate on
      values like LazyFree, Shared_Clean, Private_Clean.
      
      Link: https://lkml.kernel.org/r/20220805160003.58929-1-peterx@redhat.com
      Fixes: b1d4d9e0 ("proc/smaps: carefully handle migration entries")
      Fixes: c94b6923
      
       ("/proc/PID/smaps: Add PMD migration entry parsing")
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Huang Ying <ying.huang@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      efd41493
    • Hao Lee's avatar
      mm: add DEVICE_ZONE to FOR_ALL_ZONES · a39c5d3c
      Hao Lee authored
      FOR_ALL_ZONES should be consistent with enum zone_type.  Otherwise,
      __count_zid_vm_events have the potential to add count to wrong item when
      zid is ZONE_DEVICE.
      
      Link: https://lkml.kernel.org/r/20220807154442.GA18167@haolee.io
      
      
      Signed-off-by: default avatarHao Lee <haolee.swjtu@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a39c5d3c
    • Randy Dunlap's avatar
      kernel/sys_ni: add compat entry for fadvise64_64 · a8faed3a
      Randy Dunlap authored
      When CONFIG_ADVISE_SYSCALLS is not set/enabled and CONFIG_COMPAT is
      set/enabled, the riscv compat_syscall_table references
      'compat_sys_fadvise64_64', which is not defined:
      
      riscv64-linux-ld: arch/riscv/kernel/compat_syscall_table.o:(.rodata+0x6f8):
      undefined reference to `compat_sys_fadvise64_64'
      
      Add 'fadvise64_64' to kernel/sys_ni.c as a conditional COMPAT function so
      that when CONFIG_ADVISE_SYSCALLS is not set, there is a fallback function
      available.
      
      Link: https://lkml.kernel.org/r/20220807220934.5689-1-rdunlap@infradead.org
      Fixes: d3ac21ca
      
       ("mm: Support compiling out madvise and fadvise")
      Signed-off-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Suggested-by: default avatarArnd Bergmann <arnd@arndb.de>
      Reviewed-by: default avatarArnd Bergmann <arnd@arndb.de>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a8faed3a
    • David Hildenbrand's avatar
      mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW · 5535be30
      David Hildenbrand authored
      Ever since the Dirty COW (CVE-2016-5195) security issue happened, we know
      that FOLL_FORCE can be possibly dangerous, especially if there are races
      that can be exploited by user space.
      
      Right now, it would be sufficient to have some code that sets a PTE of a
      R/O-mapped shared page dirty, in order for it to erroneously become
      writable by FOLL_FORCE.  The implications of setting a write-protected PTE
      dirty might not be immediately obvious to everyone.
      
      And in fact ever since commit 9ae0f87d ("mm/shmem: unconditionally set
      pte dirty in mfill_atomic_install_pte"), we can use UFFDIO_CONTINUE to map
      a shmem page R/O while marking the pte dirty.  This can be used by
      unprivileged user space to modify tmpfs/shmem file content even if the
      user does not have write permissions to the file, and to bypass memfd
      write sealing -- Dirty COW restricted to tmpfs/shmem (CVE-2022-2590).
      
      To fix such security issues for good, the insight is that we really only
      need that fancy retry logic (FOLL_COW) for COW mappings that are not
      writable (!VM_WRITE).  And in a COW mapping, we really only broke COW if
      we have an exclusive anonymous page mapped.  If we have something else
      mapped, or the mapped anonymous page might be shared (!PageAnonExclusive),
      we have to trigger a write fault to break COW.  If we don't find an
      exclusive anonymous page when we retry, we have to trigger COW breaking
      once again because something intervened.
      
      Let's move away from this mandatory-retry + dirty handling and rely on our
      PageAnonExclusive() flag for making a similar decision, to use the same
      COW logic as in other kernel parts here as well.  In case we stumble over
      a PTE in a COW mapping that does not map an exclusive anonymous page, COW
      was not properly broken and we have to trigger a fake write-fault to break
      COW.
      
      Just like we do in can_change_pte_writable() added via commit 64fe24a3
      ("mm/mprotect: try avoiding write faults for exclusive anonymous pages
      when changing protection") and commit 76aefad6 ("mm/mprotect: fix
      soft-dirty check in can_change_pte_writable()"), take care of softdirty
      and uffd-wp manually.
      
      For example, a write() via /proc/self/mem to a uffd-wp-protected range has
      to fail instead of silently granting write access and bypassing the
      userspace fault handler.  Note that FOLL_FORCE is not only used for debug
      access, but also triggered by applications without debug intentions, for
      example, when pinning pages via RDMA.
      
      This fixes CVE-2022-2590. Note that only x86_64 and aarch64 are
      affected, because only those support CONFIG_HAVE_ARCH_USERFAULTFD_MINOR.
      
      Fortunately, FOLL_COW is no longer required to handle FOLL_FORCE. So
      let's just get rid of it.
      
      Thanks to Nadav Amit for pointing out that the pte_dirty() check in
      FOLL_FORCE code is problematic and might be exploitable.
      
      Note 1: We don't check for the PTE being dirty because it doesn't matter
      	for making a "was COWed" decision anymore, and whoever modifies the
      	page has to set the page dirty either way.
      
      Note 2: Kernels before extended uffd-wp support and before
      	PageAnonExclusive (< 5.19) can simply revert the problematic
      	commit instead and be safe regarding UFFDIO_CONTINUE. A backport to
      	v5.19 requires minor adjustments due to lack of
      	vma_soft_dirty_enabled().
      
      Link: https://lkml.kernel.org/r/20220809205640.70916-1-david@redhat.com
      Fixes: 9ae0f87d
      
       ("mm/shmem: unconditionally set pte dirty in mfill_atomic_install_pte")
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: David Laight <David.Laight@ACULAB.COM>
      Cc: <stable@vger.kernel.org>	[5.16]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5535be30
    • Jiri Slaby's avatar
      Revert "zram: remove double compression logic" · 37887783
      Jiri Slaby authored
      This reverts commit e7be8d1d ("zram: remove double compression
      logic") as it causes zram failures.  It does not revert cleanly, PTR_ERR
      handling was introduced in the meantime.  This is handled by appropriate
      IS_ERR.
      
      When under memory pressure, zs_malloc() can fail.  Before the above
      commit, the allocation was retried with direct reclaim enabled (GFP_NOIO).
      After the commit, it is not -- only __GFP_KSWAPD_RECLAIM is tried.
      
      So when the failure occurs under memory pressure, the overlaying
      filesystem such as ext2 (mounted by ext4 module in this case) can emit
      failures, making the (file)system unusable:
        EXT4-fs warning (device zram0): ext4_end_bio:343: I/O error 10 writing to inode 16386 starting block 159744)
        Buffer I/O error on device zram0, logical block 159744
      
      With direct reclaim, memory is really reclaimed and allocation succeeds,
      eventually.  In the worst case, the oom killer is invoked, which is proper
      outcome if user sets up zram too large (in comparison to available RAM).
      
      This very diff doesn't apply to 5.19 (stable) cleanly (see PTR_ERR note
      above). Use revert of e7be8d1d directly.
      
      Link: https://bugzilla.suse.com/show_bug.cgi?id=1202203
      Link: https://lkml.kernel.org/r/20220810070609.14402-1-jslaby@suse.cz
      Fixes: e7be8d1d
      
       ("zram: remove double compression logic")
      Signed-off-by: default avatarJiri Slaby <jslaby@suse.cz>
      Reviewed-by: default avatarSergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Alexey Romanov <avromanov@sberdevices.ru>
      Cc: Dmitry Rokosov <ddrokosov@sberdevices.ru>
      Cc: Lukas Czerner <lczerner@redhat.com>
      Cc: <stable@vger.kernel.org>	[5.19]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      37887783
    • Dan Carpenter's avatar
      get_maintainer: add Alan to .get_maintainer.ignore · d10a72de
      Dan Carpenter authored
      Alan asked to be added to the .get_maintainer.ignore list.
      
      Link: https://lkml.kernel.org/r/YvN30KhO9aD5Sza9@kili
      
      
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d10a72de
  2. Aug 15, 2022
    • Linus Torvalds's avatar
      Linux 6.0-rc1 · 568035b0
      Linus Torvalds authored
      v6.0-rc1
      568035b0
    • Yury Norov's avatar
      radix-tree: replace gfp.h inclusion with gfp_types.h · 9f162193
      Yury Norov authored
      
      
      Radix tree header includes gfp.h for __GFP_BITS_SHIFT only. Now we
      have gfp_types.h for this.
      
      Fixes powerpc allmodconfig build:
      
         In file included from include/linux/nodemask.h:97,
                          from include/linux/mmzone.h:17,
                          from include/linux/gfp.h:7,
                          from include/linux/radix-tree.h:12,
                          from include/linux/idr.h:15,
                          from include/linux/kernfs.h:12,
                          from include/linux/sysfs.h:16,
                          from include/linux/kobject.h:20,
                          from include/linux/pci.h:35,
                          from arch/powerpc/kernel/prom_init.c:24:
         include/linux/random.h: In function 'add_latent_entropy':
      >> include/linux/random.h:25:46: error: 'latent_entropy' undeclared (first use in this function); did you mean 'add_latent_entropy'?
            25 |         add_device_randomness((const void *)&latent_entropy, sizeof(latent_entropy));
               |                                              ^~~~~~~~~~~~~~
               |                                              add_latent_entropy
         include/linux/random.h:25:46: note: each undeclared identifier is reported only once for each function it appears in
      
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      CC: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
      CC: Andrew Morton <akpm@linux-foundation.org>
      CC: Jason A. Donenfeld <Jason@zx2c4.com>
      Signed-off-by: default avatarYury Norov <yury.norov@gmail.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9f162193
    • Linus Torvalds's avatar
      Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs · 74cbb480
      Linus Torvalds authored
      Pull vfs lseek fix from Al Viro:
       "Fix proc_reg_llseek() breakage. Always had been possible if somebody
        left NULL ->proc_lseek, became a practical issue now"
      
      * tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
        take care to handle NULL ->proc_lseek()
      74cbb480
    • Al Viro's avatar
      take care to handle NULL ->proc_lseek() · 3f61631d
      Al Viro authored
      Easily done now, just by clearing FMODE_LSEEK in ->f_mode
      during proc_reg_open() for such entries.
      
      Fixes: 868941b1
      
       "fs: remove no_llseek"
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      3f61631d
    • Linus Torvalds's avatar
      Merge tag 'for-linus-6.0-rc1b-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip · 5d6a0f4d
      Linus Torvalds authored
      Pull more xen updates from Juergen Gross:
      
       - fix the handling of the "persistent grants" feature negotiation
         between Xen blkfront and Xen blkback drivers
      
       - a cleanup of xen.config and adding xen.config to Xen section in
         MAINTAINERS
      
       - support HVMOP_set_evtchn_upcall_vector, which is more compliant to
         "normal" interrupt handling than the global callback used up to now
      
       - further small cleanups
      
      * tag 'for-linus-6.0-rc1b-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
        MAINTAINERS: add xen config fragments to XEN HYPERVISOR sections
        xen: remove XEN_SCRUB_PAGES in xen.config
        xen/pciback: Fix comment typo
        xen/xenbus: fix return type in xenbus_file_read()
        xen-blkfront: Apply 'feature_persistent' parameter when connect
        xen-blkback: Apply 'feature_persistent' parameter when connect
        xen-blkback: fix persistent grants negotiation
        x86/xen: Add support for HVMOP_set_evtchn_upcall_vector
      5d6a0f4d
    • Linus Torvalds's avatar
      Merge tag 'perf-tools-fixes-for-v6.0-2022-08-13' of... · 96f86ff0
      Linus Torvalds authored
      Merge tag 'perf-tools-fixes-for-v6.0-2022-08-13' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux
      
      Pull more perf tool updates from Arnaldo Carvalho de Melo:
      
       - 'perf c2c' now supports ARM64, adjust its output to cope with
         differences with what is in x86_64. Now go find false sharing on
         ARM64 (at least Neoverse) as well!
      
       - Refactor the JSON processing, making the output more compact and thus
         reducing the size of the resulting perf binary
      
       - Improvements for 'perf offcpu' profiling, including tracking child
         processes
      
       - Update Intel JSON metrics and events files for broadwellde,
         broadwellx, cascadelakex, haswellx, icelakex, ivytown, jaketown,
         knightslanding, sapphirerapids, skylakex and snowridgex
      
       - Add 'perf stat' JSON output and a 'perf test' entry for it
      
       - Ignore memfd and anonymous mmap events if jitdump present
      
       - Refactor 'perf test' shell tests allowing subdirs
      
       - Fix an error handling path in 'parse_perf_probe_command()'
      
       - Fixes for the guest Intel PT tracing patchkit in the 1st batch of
         this merge window
      
       - Print debuginfod queries if -v option is used, to explain delays in
         processing when debuginfo servers are enabled to fetch DSOs with
         richer symbol tables
      
       - Improve error message for 'perf record -p not_existing_pid'
      
       - Fix openssl and libbpf feature detection
      
       - Add PMU pai_crypto event description for IBM z16 on 'perf list'
      
       - Fix typos and duplicated words on comments in various places
      
      * tag 'perf-tools-fixes-for-v6.0-2022-08-13' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux: (81 commits)
        perf test: Refactor shell tests allowing subdirs
        perf vendor events: Update events for snowridgex
        perf vendor events: Update events and metrics for skylakex
        perf vendor events: Update metrics for sapphirerapids
        perf vendor events: Update events for knightslanding
        perf vendor events: Update metrics for jaketown
        perf vendor events: Update metrics for ivytown
        perf vendor events: Update events and metrics for icelakex
        perf vendor events: Update events and metrics for haswellx
        perf vendor events: Update events and metrics for cascadelakex
        perf vendor events: Update events and metrics for broadwellx
        perf vendor events: Update metrics for broadwellde
        perf jevents: Fold strings optimization
        perf jevents: Compress the pmu_events_table
        perf metrics: Copy entire pmu_event in find metric
        perf pmu-events: Hide the pmu_events
        perf pmu-events: Don't assume pmu_event is an array
        perf pmu-events: Move test events/metrics to JSON
        perf test: Use full metric resolution
        perf pmu-events: Hide pmu_events_map
        ...
      96f86ff0
  3. Aug 14, 2022