Skip to content
  1. Feb 09, 2022
  2. Jan 29, 2022
  3. Jan 27, 2022
    • Greg Kroah-Hartman's avatar
    • Nicholas Piggin's avatar
      KVM: do not allow mapping valid but non-reference-counted pages · f4b2bfed
      Nicholas Piggin authored
      commit f8be156b
      
       upstream.
      
      It's possible to create a region which maps valid but non-refcounted
      pages (e.g., tail pages of non-compound higher order allocations). These
      host pages can then be returned by gfn_to_page, gfn_to_pfn, etc., family
      of APIs, which take a reference to the page, which takes it from 0 to 1.
      When the reference is dropped, this will free the page incorrectly.
      
      Fix this by only taking a reference on valid pages if it was non-zero,
      which indicates it is participating in normal refcounting (and can be
      released with put_page).
      
      This addresses CVE-2021-22543.
      
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Tested-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f4b2bfed
    • Sean Christopherson's avatar
      KVM: Use kvm_pfn_t for local PFN variable in hva_to_pfn_remapped() · 29efa6b0
      Sean Christopherson authored
      commit a9545779 upstream.
      
      Use kvm_pfn_t, a.k.a. u64, for the local 'pfn' variable when retrieving
      a so called "remapped" hva/pfn pair.  In theory, the hva could resolve to
      a pfn in high memory on a 32-bit kernel.
      
      This bug was inadvertantly exposed by commit bd2fae8d ("KVM: do not
      assume PTE is writable after follow_pfn"), which added an error PFN value
      to the mix, causing gcc to comlain about overflowing the unsigned long.
      
        arch/x86/kvm/../../../virt/kvm/kvm_main.c: In function ‘hva_to_pfn_remapped’:
        include/linux/kvm_host.h:89:30: error: conversion from ‘long long unsigned int’
                                        to ‘long unsigned int’ changes value from
                                        ‘9218868437227405314’ to ‘2’ [-Werror=overflow]
         89 | #define KVM_PFN_ERR_RO_FAULT (KVM_PFN_ERR_MASK + 2)
            |                              ^
      virt/kvm/kvm_main.c:1935:9: note: in expansion of macro ‘KVM_PFN_ERR_RO_FAULT’
      
      Cc: stable@vger.kernel.org
      Fixes: add6a0cd
      
       ("KVM: MMU: try to fix up page faults before giving up")
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210208201940.1258328-1-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      29efa6b0
    • Paolo Bonzini's avatar
      KVM: do not assume PTE is writable after follow_pfn · 854a6e01
      Paolo Bonzini authored
      commit bd2fae8d upstream.
      
      In order to convert an HVA to a PFN, KVM usually tries to use
      the get_user_pages family of functinso.  This however is not
      possible for VM_IO vmas; in that case, KVM instead uses follow_pfn.
      
      In doing this however KVM loses the information on whether the
      PFN is writable.  That is usually not a problem because the main
      use of VM_IO vmas with KVM is for BARs in PCI device assignment,
      however it is a bug.  To fix it, use follow_pte and check pte_write
      while under the protection of the PTE lock.  The information can
      be used to fail hva_to_pfn_remapped or passed back to the
      caller via *writable.
      
      Usage of follow_pfn was introduced in commit add6a0cd ("KVM: MMU: try to fix
      up page faults before giving up", 2016-07-05); however, even older version
      have the same issue, all the way back to commit 2e2e3738 ("KVM:
      Handle vma regions with no backing page", 2008-07-20), as they also did
      not check whether the PFN was writable.
      
      Fixes: 2e2e3738
      
       ("KVM: Handle vma regions with no backing page")
      Reported-by: default avatarDavid Stevens <stevensd@google.com>
      Cc: 3pvd@google.com
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      [OP: backport to 4.19, adjust follow_pte() -> follow_pte_pmd()]
      Signed-off-by: default avatarOvidiu Panait <ovidiu.panait@windriver.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      [bwh: Backport to 4.9: follow_pte_pmd() does not take start or end
       parameters]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      854a6e01
    • Ross Zwisler's avatar
      mm: add follow_pte_pmd() · 0dd4d649
      Ross Zwisler authored
      commit 09796395 upstream.
      
      Patch series "Write protect DAX PMDs in *sync path".
      
      Currently dax_mapping_entry_mkclean() fails to clean and write protect
      the pmd_t of a DAX PMD entry during an *sync operation.  This can result
      in data loss, as detailed in patch 2.
      
      This series is based on Dan's "libnvdimm-pending" branch, which is the
      current home for Jan's "dax: Page invalidation fixes" series.  You can
      find a working tree here:
      
        https://git.kernel.org/cgit/linux/kernel/git/zwisler/linux.git/log/?h=dax_pmd_clean
      
      This patch (of 2):
      
      Similar to follow_pte(), follow_pte_pmd() allows either a PTE leaf or a
      huge page PMD leaf to be found and returned.
      
      Link: http://lkml.kernel.org/r/1482272586-21177-2-git-send-email-ross.zwisler@linux.intel.com
      
      
      Signed-off-by: default avatarRoss Zwisler <ross.zwisler@linux.intel.com>
      Suggested-by: default avatarDave Hansen <dave.hansen@intel.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      [bwh: Backported to 4.9: adjust context]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0dd4d649
    • Davidlohr Bueso's avatar
      lib/timerqueue: Rely on rbtree semantics for next timer · ef2e6403
      Davidlohr Bueso authored
      commit 511885d7
      
       upstream.
      
      Simplify the timerqueue code by using cached rbtrees and rely on the tree
      leftmost node semantics to get the timer with earliest expiration time.
      This is a drop in conversion, and therefore semantics remain untouched.
      
      The runtime overhead of cached rbtrees is be pretty much the same as the
      current head->next method, noting that when removing the leftmost node,
      a common operation for the timerqueue, the rb_next(leftmost) is O(1) as
      well, so the next timer will either be the right node or its parent.
      Therefore no extra pointer chasing. Finally, the size of the struct
      timerqueue_head remains the same.
      
      Passes several hours of rcutorture.
      
      Signed-off-by: default avatarDavidlohr Bueso <dbueso@suse.de>
      Signed-off-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20190724152323.bojciei3muvfxalm@linux-r8p5
      [bwh: While this was supposed to be just refactoring, it also fixed a
       security flaw (CVE-2021-20317).  Backported to 4.9:
       - Deleted code in timerqueue_del() is different before commit d852d394
      
      
         "timerqueue: Use rb_entry_safe() instead of open-coding it"
       - Adjust context]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ef2e6403
    • Davidlohr Bueso's avatar
      rbtree: cache leftmost node internally · c89a7680
      Davidlohr Bueso authored
      commit cd9e61ed upstream.
      
      Patch series "rbtree: Cache leftmost node internally", v4.
      
      A series to extending rbtrees to internally cache the leftmost node such
      that we can have fast overlap check optimization for all interval tree
      users[1].  The benefits of this series are that:
      
      (i)   Unify users that do internal leftmost node caching.
      (ii)  Optimize all interval tree users.
      (iii) Convert at least two new users (epoll and procfs) to the new interface.
      
      This patch (of 16):
      
      Red-black tree semantics imply that nodes with smaller or greater (or
      equal for duplicates) keys always be to the left and right,
      respectively.  For the kernel this is extremely evident when considering
      our rb_first() semantics.  Enabling lookups for the smallest node in the
      tree in O(1) can save a good chunk of cycles in not having to walk down
      the tree each time.  To this end there are a few core users that
      explicitly do this, such as the scheduler and rtmutexes.  There is also
      the desire for interval trees to have this optimization allowing faster
      overlap checking.
      
      This patch introduces a new 'struct rb_root_cached' which is just the
      root with a cached pointer to the leftmost node.  The reason why the
      regular rb_root was not extended instead of adding a new structure was
      that this allows the user to have the choice between memory footprint
      and actual tree performance.  The new wrappers on top of the regular
      rb_root calls are:
      
       - rb_first_cached(cached_root) -- which is a fast replacement
           for rb_first.
      
       - rb_insert_color_cached(node, cached_root, new)
      
       - rb_erase_cached(node, cached_root)
      
      In addition, augmented cached interfaces are also added for basic
      insertion and deletion operations; which becomes important for the
      interval tree changes.
      
      With the exception of the inserts, which adds a bool for updating the
      new leftmost, the interfaces are kept the same.  To this end, porting rb
      users to the cached version becomes really trivial, and keeping current
      rbtree semantics for users that don't care about the optimization
      requires zero overhead.
      
      Link: http://lkml.kernel.org/r/20170719014603.19029-2-dave@stgolabs.net
      
      
      Signed-off-by: default avatarDavidlohr Bueso <dbueso@suse.de>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c89a7680
    • Paul Moore's avatar
      cipso,calipso: resolve a number of problems with the DOI refcounts · f49f0e65
      Paul Moore authored
      commit ad5d07f4 upstream.
      
      The current CIPSO and CALIPSO refcounting scheme for the DOI
      definitions is a bit flawed in that we:
      
      1. Don't correctly match gets/puts in netlbl_cipsov4_list().
      2. Decrement the refcount on each attempt to remove the DOI from the
         DOI list, only removing it from the list once the refcount drops
         to zero.
      
      This patch fixes these problems by adding the missing "puts" to
      netlbl_cipsov4_list() and introduces a more conventional, i.e.
      not-buggy, refcounting mechanism to the DOI definitions.  Upon the
      addition of a DOI to the DOI list, it is initialized with a refcount
      of one, removing a DOI from the list removes it from the list and
      drops the refcount by one; "gets" and "puts" behave as expected with
      respect to refcounts, increasing and decreasing the DOI's refcount by
      one.
      
      Fixes: b1edeb10 ("netlabel: Replace protocol/NetLabel linking with refrerence counts")
      Fixes: d7cce015
      
       ("netlabel: Add support for removing a CALIPSO DOI.")
      Reported-by: default avatar <syzbot+9ec037722d2603a9f52e@syzkaller.appspotmail.com>
      Signed-off-by: default avatarPaul Moore <paul@paul-moore.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      [bwh: Backported to 4.9: adjust context]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f49f0e65
    • Michael Braun's avatar
      gianfar: fix jumbo packets+napi+rx overrun crash · 2cf34285
      Michael Braun authored
      commit d8861bab
      
       upstream.
      
      When using jumbo packets and overrunning rx queue with napi enabled,
      the following sequence is observed in gfar_add_rx_frag:
      
         | lstatus                              |       | skb                   |
      t  | lstatus,  size, flags                | first | len, data_len, *ptr   |
      ---+--------------------------------------+-------+-----------------------+
      13 | 18002348, 9032, INTERRUPT LAST       | 0     | 9600, 8000,  f554c12e |
      12 | 10000640, 1600, INTERRUPT            | 0     | 8000, 6400,  f554c12e |
      11 | 10000640, 1600, INTERRUPT            | 0     | 6400, 4800,  f554c12e |
      10 | 10000640, 1600, INTERRUPT            | 0     | 4800, 3200,  f554c12e |
      09 | 10000640, 1600, INTERRUPT            | 0     | 3200, 1600,  f554c12e |
      08 | 14000640, 1600, INTERRUPT FIRST      | 0     | 1600, 0,     f554c12e |
      07 | 14000640, 1600, INTERRUPT FIRST      | 1     | 0,    0,     f554c12e |
      06 | 1c000080, 128,  INTERRUPT LAST FIRST | 1     | 0,    0,     abf3bd6e |
      05 | 18002348, 9032, INTERRUPT LAST       | 0     | 8000, 6400,  c5a57780 |
      04 | 10000640, 1600, INTERRUPT            | 0     | 6400, 4800,  c5a57780 |
      03 | 10000640, 1600, INTERRUPT            | 0     | 4800, 3200,  c5a57780 |
      02 | 10000640, 1600, INTERRUPT            | 0     | 3200, 1600,  c5a57780 |
      01 | 10000640, 1600, INTERRUPT            | 0     | 1600, 0,     c5a57780 |
      00 | 14000640, 1600, INTERRUPT FIRST      | 1     | 0,    0,     c5a57780 |
      
      So at t=7 a new packets is started but not finished, probably due to rx
      overrun - but rx overrun is not indicated in the flags. Instead a new
      packets starts at t=8. This results in skb->len to exceed size for the LAST
      fragment at t=13 and thus a negative fragment size added to the skb.
      
      This then crashes:
      
      kernel BUG at include/linux/skbuff.h:2277!
      Oops: Exception in kernel mode, sig: 5 [#1]
      ...
      NIP [c04689f4] skb_pull+0x2c/0x48
      LR [c03f62ac] gfar_clean_rx_ring+0x2e4/0x844
      Call Trace:
      [ec4bfd38] [c06a84c4] _raw_spin_unlock_irqrestore+0x60/0x7c (unreliable)
      [ec4bfda8] [c03f6a44] gfar_poll_rx_sq+0x48/0xe4
      [ec4bfdc8] [c048d504] __napi_poll+0x54/0x26c
      [ec4bfdf8] [c048d908] net_rx_action+0x138/0x2c0
      [ec4bfe68] [c06a8f34] __do_softirq+0x3a4/0x4fc
      [ec4bfed8] [c0040150] run_ksoftirqd+0x58/0x70
      [ec4bfee8] [c0066ecc] smpboot_thread_fn+0x184/0x1cc
      [ec4bff08] [c0062718] kthread+0x140/0x144
      [ec4bff38] [c0012350] ret_from_kernel_thread+0x14/0x1c
      
      This patch fixes this by checking for computed LAST fragment size, so a
      negative sized fragment is never added.
      In order to prevent the newer rx frame from getting corrupted, the FIRST
      flag is checked to discard the incomplete older frame.
      
      Signed-off-by: default avatarMichael Braun <michael-dev@fami-braun.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2cf34285
    • Andy Spencer's avatar
      gianfar: simplify FCS handling and fix memory leak · 8d18509b
      Andy Spencer authored
      commit d903ec77
      
       upstream.
      
      Previously, buffer descriptors containing only the frame check sequence
      (FCS) were skipped and not added to the skb. However, the page reference
      count was still incremented, leading to a memory leak.
      
      Fixing this inside gfar_add_rx_frag() is difficult due to reserved
      memory handling and page reuse. Instead, move the FCS handling to
      gfar_process_frame() and trim off the FCS before passing the skb up the
      networking stack.
      
      Signed-off-by: default avatarAndy Spencer <aspencer@spacex.com>
      Signed-off-by: default avatarJim Gruen <jgruen@spacex.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8d18509b
    • Dave Airlie's avatar
      drm/ttm/nouveau: don't call tt destroy callback on alloc failure. · 70f44dfb
      Dave Airlie authored
      commit 5de5b6ec
      
       upstream.
      
      This is confusing, and from my reading of all the drivers only
      nouveau got this right.
      
      Just make the API act under driver control of it's own allocation
      failing, and don't call destroy, if the page table fails to
      create there is nothing to cleanup here.
      
      (I'm willing to believe I've missed something here, so please
      review deeply).
      
      Reviewed-by: default avatarChristian König <christian.koenig@amd.com>
      Signed-off-by: default avatarDave Airlie <airlied@redhat.com>
      Link: https://patchwork.freedesktop.org/patch/msgid/20200728041736.20689-1-airlied@gmail.com
      
      
      [bwh: Backported to 4.14:
       - Drop change in ttm_sg_tt_init()
       - Adjust context]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      70f44dfb
    • Linus Torvalds's avatar
      gup: document and work around "COW can break either way" issue · 0c29640b
      Linus Torvalds authored
      commit 9bbd42e7
      
       upstream.
      
      Doing a "get_user_pages()" on a copy-on-write page for reading can be
      ambiguous: the page can be COW'ed at any time afterwards, and the
      direction of a COW event isn't defined.
      
      Yes, whoever writes to it will generally do the COW, but if the thread
      that did the get_user_pages() unmapped the page before the write (and
      that could happen due to memory pressure in addition to any outright
      action), the writer could also just take over the old page instead.
      
      End result: the get_user_pages() call might result in a page pointer
      that is no longer associated with the original VM, and is associated
      with - and controlled by - another VM having taken it over instead.
      
      So when doing a get_user_pages() on a COW mapping, the only really safe
      thing to do would be to break the COW when getting the page, even when
      only getting it for reading.
      
      At the same time, some users simply don't even care.
      
      For example, the perf code wants to look up the page not because it
      cares about the page, but because the code simply wants to look up the
      physical address of the access for informational purposes, and doesn't
      really care about races when a page might be unmapped and remapped
      elsewhere.
      
      This adds logic to force a COW event by setting FOLL_WRITE on any
      copy-on-write mapping when FOLL_GET (or FOLL_PIN) is used to get a page
      pointer as a result.
      
      The current semantics end up being:
      
       - __get_user_pages_fast(): no change. If you don't ask for a write,
         you won't break COW. You'd better know what you're doing.
      
       - get_user_pages_fast(): the fast-case "look it up in the page tables
         without anything getting mmap_sem" now refuses to follow a read-only
         page, since it might need COW breaking.  Which happens in the slow
         path - the fast path doesn't know if the memory might be COW or not.
      
       - get_user_pages() (including the slow-path fallback for gup_fast()):
         for a COW mapping, turn on FOLL_WRITE for FOLL_GET/FOLL_PIN, with
         very similar semantics to FOLL_FORCE.
      
      If it turns out that we want finer granularity (ie "only break COW when
      it might actually matter" - things like the zero page are special and
      don't need to be broken) we might need to push these semantics deeper
      into the lookup fault path.  So if people care enough, it's possible
      that we might end up adding a new internal FOLL_BREAK_COW flag to go
      with the internal FOLL_COW flag we already have for tracking "I had a
      COW".
      
      Alternatively, if it turns out that different callers might want to
      explicitly control the forced COW break behavior, we might even want to
      make such a flag visible to the users of get_user_pages() instead of
      using the above default semantics.
      
      But for now, this is mostly commentary on the issue (this commit message
      being a lot bigger than the patch, and that patch in turn is almost all
      comments), with that minimal "enable COW breaking early" logic using the
      existing FOLL_WRITE behavior.
      
      [ It might be worth noting that we've always had this ambiguity, and it
        could arguably be seen as a user-space issue.
      
        You only get private COW mappings that could break either way in
        situations where user space is doing cooperative things (ie fork()
        before an execve() etc), but it _is_ surprising and very subtle, and
        fork() is supposed to give you independent address spaces.
      
        So let's treat this as a kernel issue and make the semantics of
        get_user_pages() easier to understand. Note that obviously a true
        shared mapping will still get a page that can change under us, so this
        does _not_ mean that get_user_pages() somehow returns any "stable"
        page ]
      
      [surenb: backport notes
      	Replaced (gup_flags | FOLL_WRITE) with write=1 in gup_pgd_range.
      	Removed FOLL_PIN usage in should_force_cow_break since it's missing in
      	the earlier kernels.]
      
      Reported-by: default avatarJann Horn <jannh@google.com>
      Tested-by: default avatarChristoph Hellwig <hch@lst.de>
      Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
      Acked-by: default avatarKirill Shutemov <kirill@shutemov.name>
      Acked-by: default avatarJan Kara <jack@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      [surenb: backport to 4.19 kernel]
      Cc: stable@vger.kernel.org # 4.19.x
      Signed-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      [bwh: Backported to 4.9:
       - Generic get_user_pages_fast() calls __get_user_pages_fast() here,
         so make it pass write=1
       - Various architectures have their own implementations of
         get_user_pages_fast(), so apply the corresponding change there
       - Adjust context]
      Signed-off-by: default avatarBen Hutchings <ben@decadent.org.uk>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0c29640b