Skip to content
  1. Apr 02, 2020
    • Tiwei Bie's avatar
      vhost: introduce vDPA-based backend · 4c8cf318
      Tiwei Bie authored
      
      
      This patch introduces a vDPA-based vhost backend. This backend is
      built on top of the same interface defined in virtio-vDPA and provides
      a generic vhost interface for userspace to accelerate the virtio
      devices in guest.
      
      This backend is implemented as a vDPA device driver on top of the same
      ops used in virtio-vDPA. It will create char device entry named
      vhost-vdpa-$index for userspace to use. Userspace can use vhost ioctls
      on top of this char device to setup the backend.
      
      Vhost ioctls are extended to make it type agnostic and behave like a
      virtio device, this help to eliminate type specific API like what
      vhost_net/scsi/vsock did:
      
      - VHOST_VDPA_GET_DEVICE_ID: get the virtio device ID which is defined
        by virtio specification to differ from different type of devices
      - VHOST_VDPA_GET_VRING_NUM: get the maximum size of virtqueue
        supported by the vDPA device
      - VHSOT_VDPA_SET/GET_STATUS: set and get virtio status of vDPA device
      - VHOST_VDPA_SET/GET_CONFIG: access virtio config space
      - VHOST_VDPA_SET_VRING_ENABLE: enable a specific virtqueue
      
      For memory mapping, IOTLB API is mandated for vhost-vDPA which means
      userspace drivers are required to use
      VHOST_IOTLB_UPDATE/VHOST_IOTLB_INVALIDATE to add or remove mapping for
      a specific userspace memory region.
      
      The vhost-vDPA API is designed to be type agnostic, but it allows net
      device only in current stage. Due to the lacking of control virtqueue
      support, some features were filter out by vhost-vdpa.
      
      We will enable more features and devices in the near future.
      
      Signed-off-by: default avatarTiwei Bie <tiwei.bie@intel.com>
      Signed-off-by: default avatarEugenio Pérez <eperezma@redhat.com>
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      Link: https://lore.kernel.org/r/20200326140125.19794-8-jasowang@redhat.com
      
      
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      4c8cf318
    • Jason Wang's avatar
      virtio: introduce a vDPA based transport · c043b4a8
      Jason Wang authored
      
      
      This patch introduces a vDPA transport for virtio. This is used to
      use kernel virtio driver to drive the vDPA device that is capable
      of populating virtqueue directly.
      
      A new virtio-vdpa driver will be registered to the vDPA bus, when a
      new virtio-vdpa device is probed, it will register the device with
      vdpa based config ops. This means it is a software transport between
      vDPA driver and vDPA device. The transport was implemented through
      bus_ops of vDPA parent.
      
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      Link: https://lore.kernel.org/r/20200326140125.19794-7-jasowang@redhat.com
      
      
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      c043b4a8
    • Jason Wang's avatar
      vDPA: introduce vDPA bus · 961e9c84
      Jason Wang authored
      
      
      vDPA device is a device that uses a datapath which complies with the
      virtio specifications with vendor specific control path. vDPA devices
      can be both physically located on the hardware or emulated by
      software. vDPA hardware devices are usually implemented through PCIE
      with the following types:
      
      - PF (Physical Function) - A single Physical Function
      - VF (Virtual Function) - Device that supports single root I/O
        virtualization (SR-IOV). Its Virtual Function (VF) represents a
        virtualized instance of the device that can be assigned to different
        partitions
      - ADI (Assignable Device Interface) and its equivalents - With
        technologies such as Intel Scalable IOV, a virtual device (VDEV)
        composed by host OS utilizing one or more ADIs. Or its equivalent
        like SF (Sub function) from Mellanox.
      
      >From a driver's perspective, depends on how and where the DMA
      translation is done, vDPA devices are split into two types:
      
      - Platform specific DMA translation - From the driver's perspective,
        the device can be used on a platform where device access to data in
        memory is limited and/or translated. An example is a PCIE vDPA whose
        DMA request was tagged via a bus (e.g PCIE) specific way. DMA
        translation and protection are done at PCIE bus IOMMU level.
      - Device specific DMA translation - The device implements DMA
        isolation and protection through its own logic. An example is a vDPA
        device which uses on-chip IOMMU.
      
      To hide the differences and complexity of the above types for a vDPA
      device/IOMMU options and in order to present a generic virtio device
      to the upper layer, a device agnostic framework is required.
      
      This patch introduces a software vDPA bus which abstracts the
      common attributes of vDPA device, vDPA bus driver and the
      communication method (vdpa_config_ops) between the vDPA device
      abstraction and the vDPA bus driver. This allows multiple types of
      drivers to be used for vDPA device like the virtio_vdpa and vhost_vdpa
      driver to operate on the bus and allow vDPA device could be used by
      either kernel virtio driver or userspace vhost drivers as:
      
         virtio drivers  vhost drivers
                |             |
          [virtio bus]   [vhost uAPI]
                |             |
         virtio device   vhost device
         virtio_vdpa drv vhost_vdpa drv
                   \       /
                  [vDPA bus]
                       |
                  vDPA device
                  hardware drv
                       |
                  [hardware bus]
                       |
                  vDPA hardware
      
      With the abstraction of vDPA bus and vDPA bus operations, the
      difference and complexity of the under layer hardware is hidden from
      upper layer. The vDPA bus drivers on top can use a unified
      vdpa_config_ops to control different types of vDPA device.
      
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      Link: https://lore.kernel.org/r/20200326140125.19794-6-jasowang@redhat.com
      
      
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      961e9c84
    • Jason Wang's avatar
      vringh: IOTLB support · 9ad9c49c
      Jason Wang authored
      
      
      This patch implements the third memory accessor for vringh besides
      current kernel and userspace accessors. This idea is to allow vringh
      to do the address translation through an IOTLB which is implemented
      via vhost_map interval tree. Users should setup and IOVA to PA mapping
      in this IOTLB.
      
      This allows us to:
      
      - Use vringh to access virtqueues with vIOMMU
      - Use vringh to implement software virtqueues for vDPA devices
      
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      Link: https://lore.kernel.org/r/20200326140125.19794-5-jasowang@redhat.com
      
      
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      9ad9c49c
    • Jason Wang's avatar
      vhost: factor out IOTLB · 0bbe3066
      Jason Wang authored
      
      
      This patch factors out IOTLB into a dedicated module in order to be
      reused by other modules like vringh. User may choose to enable the
      automatic retiring by specifying VHOST_IOTLB_FLAG_RETIRE flag to fit
      for the case of vhost device IOTLB implementation.
      
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      Link: https://lore.kernel.org/r/20200326140125.19794-4-jasowang@redhat.com
      
      
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      0bbe3066
    • Jason Wang's avatar
      vhost: allow per device message handler · 792a4f2e
      Jason Wang authored
      
      
      This patch allow device to register its own message handler during
      vhost_dev_init(). vDPA device will use it to implement its own DMA
      mapping logic.
      
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      Link: https://lore.kernel.org/r/20200326140125.19794-3-jasowang@redhat.com
      
      
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      792a4f2e
    • Jason Wang's avatar
      vhost: refine vhost and vringh kconfig · 20c384f1
      Jason Wang authored
      
      
      Currently, CONFIG_VHOST depends on CONFIG_VIRTUALIZATION. But vhost is
      not necessarily for VM since it's a generic userspace and kernel
      communication protocol. Such dependency may prevent archs without
      virtualization support from using vhost.
      
      To solve this, a dedicated vhost menu is created under drivers so
      CONIFG_VHOST can be decoupled out of CONFIG_VIRTUALIZATION.
      
      While at it, also squash Kconfig.vringh into vhost Kconfig file. This
      avoids the trick of conditional inclusion from VOP or CAIF. Then it
      will be easier to introduce new vringh users and common dependency for
      both vringh and vhost.
      
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      Link: https://lore.kernel.org/r/20200326140125.19794-2-jasowang@redhat.com
      
      
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      20c384f1
  2. Mar 23, 2020
    • David Hildenbrand's avatar
      virtio-balloon: Switch back to OOM handler for VIRTIO_BALLOON_F_DEFLATE_ON_OOM · 5a6b4cc5
      David Hildenbrand authored
      Commit 71994620 ("virtio_balloon: replace oom notifier with shrinker")
      changed the behavior when deflation happens automatically. Instead of
      deflating when called by the OOM handler, the shrinker is used.
      
      However, the balloon is not simply some slab cache that should be
      shrunk when under memory pressure. The shrinker does not have a concept of
      priorities, so this behavior cannot be configured.
      
      There was a report that this results in undesired side effects when
      inflating the balloon to shrink the page cache. [1]
      	"When inflating the balloon against page cache (i.e. no free memory
      	 remains) vmscan.c will both shrink page cache, but also invoke the
      	 shrinkers -- including the balloon's shrinker. So the balloon
      	 driver allocates memory which requires reclaim, vmscan gets this
      	 memory by shrinking the balloon, and then the driver adds the
      	 memory back to the balloon. Basically a busy no-op."
      
      The name "deflate on OOM" makes it pretty clear when deflation should
      happen - after other approaches to reclaim memory failed, not while
      reclaiming. This allows to minimize the footprint of a guest - memory
      will only be taken out of the balloon when really needed.
      
      Especially, a drop_slab() will result in the whole balloon getting
      deflated - undesired. While handling it via the OOM handler might not be
      perfect, it keeps existing behavior. If we want a different behavior, then
      we need a new feature bit and document it properly (although, there should
      be a clear use case and the intended effects should be well described).
      
      Keep using the shrinker for VIRTIO_BALLOON_F_FREE_PAGE_HINT, because
      this has no such side effects. Always register the shrinker with
      VIRTIO_BALLOON_F_FREE_PAGE_HINT now. We are always allowed to reuse free
      pages that are still to be processed by the guest. The hypervisor takes
      care of identifying and resolving possible races between processing a
      hinting request and the guest reusing a page.
      
      In contrast to pre commit 71994620 ("virtio_balloon: replace oom
      notifier with shrinker"), don't add a moodule parameter to configure the
      number of pages to deflate on OOM. Can be re-added if really needed.
      Also, pay attention that leak_balloon() returns the number of 4k pages -
      convert it properly in virtio_balloon_oom_notify().
      
      Note1: using the OOM handler is frowned upon, but it really is what we
             need for this feature.
      
      Note2: without VIRTIO_BALLOON_F_MUST_TELL_HOST (iow, always with QEMU) we
             could actually skip sending deflation requests to our hypervisor,
             making the OOM path *very* simple. Besically freeing pages and
             updating the balloon. If the communication with the host ever
             becomes a problem on this call path.
      
      [1] https://www.spinics.net/lists/linux-virtualization/msg40863.html
      
      Test report by Tyler Sanderson:
      
      Test setup: VM with 16 CPU, 64GB RAM. Running Debian 10. We have a 42
      GB file full of random bytes that we continually cat to /dev/null.
      This fills the page cache as the file is read. Meanwhile we trigger
      the balloon to inflate, with a target size of 53 GB. This setup causes
      the balloon inflation to pressure the page cache as the page cache is
      also trying to grow. Afterwards we shrink the balloon back to zero (so
      total deflate = total inflate).
      
      Without patch (kernel 4.19.0-5):
      Inflation never reaches the target until we stop the "cat file >
      /dev/null" process. Total inflation time was 542 seconds. The longest
      period that made no net forward progress was 315 seconds (see attached
      graph).
      Result of "grep balloon /proc/vmstat" after the test:
      balloon_inflate 154828377
      balloon_deflate 154828377
      
      With patch (kernel 5.6.0-rc4+):
      Total inflation duration was 63 seconds. No deflate-queue activity
      occurs when pressuring the page-cache.
      Result of "grep balloon /proc/vmstat" after the test:
      balloon_inflate 12968539
      balloon_deflate 12968539
      
      Conclusion: This patch fixes the issue. In the test it reduced
      inflate/deflate activity by 12x, and reduced inflation time by 8.6x.
      But more importantly, if we hadn't killed the "grep balloon
      /proc/vmstat" process then, without the patch, the inflation process
      would never reach the target.
      
      Attached [1] is a png of a graph showing the problematic behavior without
      this patch. It shows deflate-queue activity increasing linearly while
      balloon size stays constant over the course of more than 8 minutes of
      the test.
      
      [1] https://lore.kernel.org/linux-mm/CAJuQAmphPcfew1v_EOgAdSFiprzjiZjmOf3iJDmFX0gD6b9TYQ@mail.gmail.com/2-without_patch.png
      
      Full test report and discussion [2]:
      
      [2] https://lore.kernel.org/r/CAJuQAmphPcfew1v_EOgAdSFiprzjiZjmOf3iJDmFX0gD6b9TYQ@mail.gmail.com
      
      
      
      Tested-by: default avatarTyler Sanderson <tysand@google.com>
      Reported-by: default avatarTyler Sanderson <tysand@google.com>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Wei Wang <wei.w.wang@intel.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Link: https://lore.kernel.org/r/20200205163402.42627-4-david@redhat.com
      
      
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      5a6b4cc5
    • Yuri Benditovich's avatar
      virtio-net: Introduce hash report feature · 3024e209
      Yuri Benditovich authored
      
      
      The feature VIRTIO_NET_F_HASH_REPORT extends the
      layout of the packet and requests the device to
      calculate hash on incoming packets and report it
      in the packet header.
      
      Signed-off-by: default avatarYuri Benditovich <yuri.benditovich@daynix.com>
      Link: https://lore.kernel.org/r/20200302115003.14877-4-yuri.benditovich@daynix.com
      
      
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      3024e209
    • Yuri Benditovich's avatar
      virtio-net: Introduce RSS receive steering feature · fd58bf67
      Yuri Benditovich authored
      
      
      RSS (Receive-side scaling) defines hash calculation
      rules and decision on receive virtqueue according to
      the calculated hash, provided mask to apply and
      provided indirection table containing indices of
      receive virqueues. The driver sends the control
      command to enable multiqueue and provide parameters
      for receive steering.
      
      Signed-off-by: default avatarYuri Benditovich <yuri.benditovich@daynix.com>
      Link: https://lore.kernel.org/r/20200302115003.14877-3-yuri.benditovich@daynix.com
      
      
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      fd58bf67
    • Yuri Benditovich's avatar
      virtio-net: Introduce extended RSC feature · 22b436c9
      Yuri Benditovich authored
      
      
      VIRTIO_NET_F_RSC_EXT feature bit indicates that the device
      is able to provide extended RSC information. When the feature
      is negotiatede and 'gso_type' field in received packet is not
      GSO_NONE, the device reports number of coalesced packets in
      'csum_start' field and number of duplicated acks in 'csum_offset'
      field and sets VIRTIO_NET_HDR_F_RSC_INFO in 'flags' field.
      
      Signed-off-by: default avatarYuri Benditovich <yuri.benditovich@daynix.com>
      Link: https://lore.kernel.org/r/20200302115003.14877-2-yuri.benditovich@daynix.com
      
      
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      22b436c9
    • Michael S. Tsirkin's avatar
      tools/virtio: option to build an out of tree module · d5f5ee2a
      Michael S. Tsirkin authored
      
      
      Handy for testing with distro kernels.
      Warn that the resulting module is completely unsupported,
      and isn't intended for production use.
      
      Usage:
              make oot # builds vhost_test.ko, vhost.ko
              make oot-clean # cleans out files created
      
      Acked-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      d5f5ee2a
    • Linus Torvalds's avatar
      Linux 5.6-rc7 · 16fbf79b
      Linus Torvalds authored
      16fbf79b
    • Linus Torvalds's avatar
      Merge tag 'for-5.6-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux · 67d584e3
      Linus Torvalds authored
      Pull btrfs fixes from David Sterba:
       "Two fixes.
      
        The first is a regression: when dropping some incompat bits the
        conditions were reversed. The other is a fix for rename whiteout
        potentially leaving stack memory linked to a list"
      
      * tag 'for-5.6-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
        btrfs: fix removal of raid[56|1c34} incompat flags after removing block group
        btrfs: fix log context list corruption after rename whiteout error
      67d584e3
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · b3c03db6
      Linus Torvalds authored
      Merge misc fixes from Andrew Morton:
       "10 fixes"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
        x86/mm: split vmalloc_sync_all()
        mm, slub: prevent kmalloc_node crashes and memory leaks
        mm/mmu_notifier: silence PROVE_RCU_LIST warnings
        epoll: fix possible lost wakeup on epoll_ctl() path
        mm: do not allow MADV_PAGEOUT for CoW pages
        mm, memcg: throttle allocators based on ancestral memory.high
        mm, memcg: fix corruption on 64-bit divisor in memory.high throttling
        page-flags: fix a crash at SetPageError(THP_SWAP)
        mm/hotplug: fix hot remove failure in SPARSEMEM|!VMEMMAP case
        memcg: fix NULL pointer dereference in __mem_cgroup_usage_unregister_event
      b3c03db6
  3. Mar 22, 2020
    • Joerg Roedel's avatar
      x86/mm: split vmalloc_sync_all() · 763802b5
      Joerg Roedel authored
      Commit 3f8fd02b ("mm/vmalloc: Sync unmappings in
      __purge_vmap_area_lazy()") introduced a call to vmalloc_sync_all() in
      the vunmap() code-path.  While this change was necessary to maintain
      correctness on x86-32-pae kernels, it also adds additional cycles for
      architectures that don't need it.
      
      Specifically on x86-64 with CONFIG_VMAP_STACK=y some people reported
      severe performance regressions in micro-benchmarks because it now also
      calls the x86-64 implementation of vmalloc_sync_all() on vunmap().  But
      the vmalloc_sync_all() implementation on x86-64 is only needed for newly
      created mappings.
      
      To avoid the unnecessary work on x86-64 and to gain the performance
      back, split up vmalloc_sync_all() into two functions:
      
      	* vmalloc_sync_mappings(), and
      	* vmalloc_sync_unmappings()
      
      Most call-sites to vmalloc_sync_all() only care about new mappings being
      synchronized.  The only exception is the new call-site added in the
      above mentioned commit.
      
      Shile Zhang directed us to a report of an 80% regression in reaim
      throughput.
      
      Fixes: 3f8fd02b
      
       ("mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy()")
      Reported-by: default avatarkernel test robot <oliver.sang@intel.com>
      Reported-by: default avatarShile Zhang <shile.zhang@linux.alibaba.com>
      Signed-off-by: default avatarJoerg Roedel <jroedel@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: default avatarBorislav Petkov <bp@suse.de>
      Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	[GHES]
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20191009124418.8286-1-joro@8bytes.org
      Link: https://lists.01.org/hyperkitty/list/lkp@lists.01.org/thread/4D3JPPHBNOSPFK2KEPC6KGKS6J25AIDB/
      Link: http://lkml.kernel.org/r/20191113095530.228959-1-shile.zhang@linux.alibaba.com
      
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      763802b5
    • Vlastimil Babka's avatar
      mm, slub: prevent kmalloc_node crashes and memory leaks · 0715e6c5
      Vlastimil Babka authored
      Sachin reports [1] a crash in SLUB __slab_alloc():
      
        BUG: Kernel NULL pointer dereference on read at 0x000073b0
        Faulting instruction address: 0xc0000000003d55f4
        Oops: Kernel access of bad area, sig: 11 [#1]
        LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
        Modules linked in:
        CPU: 19 PID: 1 Comm: systemd Not tainted 5.6.0-rc2-next-20200218-autotest #1
        NIP:  c0000000003d55f4 LR: c0000000003d5b94 CTR: 0000000000000000
        REGS: c0000008b37836d0 TRAP: 0300   Not tainted  (5.6.0-rc2-next-20200218-autotest)
        MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 24004844  XER: 00000000
        CFAR: c00000000000dec4 DAR: 00000000000073b0 DSISR: 40000000 IRQMASK: 1
        GPR00: c0000000003d5b94 c0000008b3783960 c00000000155d400 c0000008b301f500
        GPR04: 0000000000000dc0 0000000000000002 c0000000003443d8 c0000008bb398620
        GPR08: 00000008ba2f0000 0000000000000001 0000000000000000 0000000000000000
        GPR12: 0000000024004844 c00000001ec52a00 0000000000000000 0000000000000000
        GPR16: c0000008a1b20048 c000000001595898 c000000001750c18 0000000000000002
        GPR20: c000000001750c28 c000000001624470 0000000fffffffe0 5deadbeef0000122
        GPR24: 0000000000000001 0000000000000dc0 0000000000000002 c0000000003443d8
        GPR28: c0000008b301f500 c0000008bb398620 0000000000000000 c00c000002287180
        NIP ___slab_alloc+0x1f4/0x760
        LR __slab_alloc+0x34/0x60
        Call Trace:
          ___slab_alloc+0x334/0x760 (unreliable)
          __slab_alloc+0x34/0x60
          __kmalloc_node+0x110/0x490
          kvmalloc_node+0x58/0x110
          mem_cgroup_css_online+0x108/0x270
          online_css+0x48/0xd0
          cgroup_apply_control_enable+0x2ec/0x4d0
          cgroup_mkdir+0x228/0x5f0
          kernfs_iop_mkdir+0x90/0xf0
          vfs_mkdir+0x110/0x230
          do_mkdirat+0xb0/0x1a0
          system_call+0x5c/0x68
      
      This is a PowerPC platform with following NUMA topology:
      
        available: 2 nodes (0-1)
        node 0 cpus:
        node 0 size: 0 MB
        node 0 free: 0 MB
        node 1 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
        node 1 size: 35247 MB
        node 1 free: 30907 MB
        node distances:
        node   0   1
          0:  10  40
          1:  40  10
      
        possible numa nodes: 0-31
      
      This only happens with a mmotm patch "mm/memcontrol.c: allocate
      shrinker_map on appropriate NUMA node" [2] which effectively calls
      kmalloc_node for each possible node.  SLUB however only allocates
      kmem_cache_node on online N_NORMAL_MEMORY nodes, and relies on
      node_to_mem_node to return such valid node for other nodes since commit
      a561ce00 ("slub: fall back to node_to_mem_node() node if allocating
      on memoryless node").  This is however not true in this configuration
      where the _node_numa_mem_ array is not initialized for nodes 0 and 2-31,
      thus it contains zeroes and get_partial() ends up accessing
      non-allocated kmem_cache_node.
      
      A related issue was reported by Bharata (originally by Ramachandran) [3]
      where a similar PowerPC configuration, but with mainline kernel without
      patch [2] ends up allocating large amounts of pages by kmalloc-1k
      kmalloc-512.  This seems to have the same underlying issue with
      node_to_mem_node() not behaving as expected, and might probably also
      lead to an infinite loop with CONFIG_SLUB_CPU_PARTIAL [4].
      
      This patch should fix both issues by not relying on node_to_mem_node()
      anymore and instead simply falling back to NUMA_NO_NODE, when
      kmalloc_node(node) is attempted for a node that's not online, or has no
      usable memory.  The "usable memory" condition is also changed from
      node_present_pages() to N_NORMAL_MEMORY node state, as that is exactly
      the condition that SLUB uses to allocate kmem_cache_node structures.
      The check in get_partial() is removed completely, as the checks in
      ___slab_alloc() are now sufficient to prevent get_partial() being
      reached with an invalid node.
      
      [1] https://lore.kernel.org/linux-next/3381CD91-AB3D-4773-BA04-E7A072A63968@linux.vnet.ibm.com/
      [2] https://lore.kernel.org/linux-mm/fff0e636-4c36-ed10-281c-8cdb0687c839@virtuozzo.com/
      [3] https://lore.kernel.org/linux-mm/20200317092624.GB22538@in.ibm.com/
      [4] https://lore.kernel.org/linux-mm/088b5996-faae-8a56-ef9c-5b567125ae54@suse.cz/
      
      Fixes: a561ce00
      
       ("slub: fall back to node_to_mem_node() node if allocating on memoryless node")
      Reported-by: default avatarSachin Sant <sachinp@linux.vnet.ibm.com>
      Reported-by: default avatarPUVICHAKRAVARTHY RAMACHANDRAN <puvichakravarthy@in.ibm.com>
      Signed-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: default avatarSachin Sant <sachinp@linux.vnet.ibm.com>
      Tested-by: default avatarBharata B Rao <bharata@linux.ibm.com>
      Reviewed-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Nathan Lynch <nathanl@linux.ibm.com>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20200320115533.9604-1-vbabka@suse.cz
      
      
      Debugged-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0715e6c5
    • Qian Cai's avatar
      mm/mmu_notifier: silence PROVE_RCU_LIST warnings · 63886bad
      Qian Cai authored
      
      
      It is safe to traverse mm->notifier_subscriptions->list either under
      SRCU read lock or mm->notifier_subscriptions->lock using
      hlist_for_each_entry_rcu().  Silence the PROVE_RCU_LIST false positives,
      for example,
      
        WARNING: suspicious RCU usage
        -----------------------------
        mm/mmu_notifier.c:484 RCU-list traversed in non-reader section!!
      
        other info that might help us debug this:
      
        rcu_scheduler_active = 2, debug_locks = 1
        3 locks held by libvirtd/802:
         #0: ffff9321e3f58148 (&mm->mmap_sem#2){++++}, at: do_mprotect_pkey+0xe1/0x3e0
         #1: ffffffff91ae6160 (mmu_notifier_invalidate_range_start){+.+.}, at: change_p4d_range+0x5fa/0x800
         #2: ffffffff91ae6e08 (srcu){....}, at: __mmu_notifier_invalidate_range_start+0x178/0x460
      
        stack backtrace:
        CPU: 7 PID: 802 Comm: libvirtd Tainted: G          I       5.6.0-rc6-next-20200317+ #2
        Hardware name: HP ProLiant BL460c Gen8, BIOS I31 11/02/2014
        Call Trace:
          dump_stack+0xa4/0xfe
          lockdep_rcu_suspicious+0xeb/0xf5
          __mmu_notifier_invalidate_range_start+0x3ff/0x460
          change_p4d_range+0x746/0x800
          change_protection+0x1df/0x300
          mprotect_fixup+0x245/0x3e0
          do_mprotect_pkey+0x23b/0x3e0
          __x64_sys_mprotect+0x51/0x70
          do_syscall_64+0x91/0xae8
          entry_SYSCALL_64_after_hwframe+0x49/0xb3
      
      Signed-off-by: default avatarQian Cai <cai@lca.pw>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      Reviewed-by: default avatarJason Gunthorpe <jgg@mellanox.com>
      Link: http://lkml.kernel.org/r/20200317175640.2047-1-cai@lca.pw
      
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      63886bad
    • Roman Penyaev's avatar
      epoll: fix possible lost wakeup on epoll_ctl() path · 1b53734b
      Roman Penyaev authored
      This fixes possible lost wakeup introduced by commit a218cc49.
      Originally modifications to ep->wq were serialized by ep->wq.lock, but
      in commit a218cc49 ("epoll: use rwlock in order to reduce
      ep_poll_callback() contention") a new rw lock was introduced in order to
      relax fd event path, i.e. callers of ep_poll_callback() function.
      
      After the change ep_modify and ep_insert (both are called on epoll_ctl()
      path) were switched to ep->lock, but ep_poll (epoll_wait) was using
      ep->wq.lock on wqueue list modification.
      
      The bug doesn't lead to any wqueue list corruptions, because wake up
      path and list modifications were serialized by ep->wq.lock internally,
      but actual waitqueue_active() check prior wake_up() call can be
      reordered with modifications of ep ready list, thus wake up can be lost.
      
      And yes, can be healed by explicit smp_mb():
      
        list_add_tail(&epi->rdlink, &ep->rdllist);
        smp_mb();
        if (waitqueue_active(&ep->wq))
      	wake_up(&ep->wp);
      
      But let's make it simple, thus current patch replaces ep->wq.lock with
      the ep->lock for wqueue modifications, thus wake up path always observes
      activeness of the wqueue correcty.
      
      Fixes: a218cc49
      
       ("epoll: use rwlock in order to reduce ep_poll_callback() contention")
      Reported-by: default avatarMax Neunhoeffer <max@arangodb.com>
      Signed-off-by: default avatarRoman Penyaev <rpenyaev@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: default avatarMax Neunhoeffer <max@arangodb.com>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: Christopher Kohlhoff <chris.kohlhoff@clearpool.io>
      Cc: Davidlohr Bueso <dbueso@suse.de>
      Cc: Jason Baron <jbaron@akamai.com>
      Cc: Jes Sorensen <jes.sorensen@gmail.com>
      Cc: <stable@vger.kernel.org>	[5.1+]
      Link: http://lkml.kernel.org/r/20200214170211.561524-1-rpenyaev@suse.de
      References: https://bugzilla.kernel.org/show_bug.cgi?id=205933
      
      
      Bisected-by: default avatarMax Neunhoeffer <max@arangodb.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1b53734b
    • Michal Hocko's avatar
      mm: do not allow MADV_PAGEOUT for CoW pages · 12e967fd
      Michal Hocko authored
      Jann has brought up a very interesting point [1].  While shared pages
      are excluded from MADV_PAGEOUT normally, CoW pages can be easily
      reclaimed that way.  This can lead to all sorts of hard to debug
      problems.  E.g.  performance problems outlined by Daniel [2].
      
      There are runtime environments where there is a substantial memory
      shared among security domains via CoW memory and a easy to reclaim way
      of that memory, which MADV_{COLD,PAGEOUT} offers, can lead to either
      performance degradation in for the parent process which might be more
      privileged or even open side channel attacks.
      
      The feasibility of the latter is not really clear to me TBH but there is
      no real reason for exposure at this stage.  It seems there is no real
      use case to depend on reclaiming CoW memory via madvise at this stage so
      it is much easier to simply disallow it and this is what this patch
      does.  Put it simply MADV_{PAGEOUT,COLD} can operate only on the
      exclusively owned memory which is a straightforward semantic.
      
      [1] http://lkml.kernel.org/r/CAG48ez0G3JkMq61gUmyQAaCq=_TwHbi1XKzWRooxZkv08PQKuw@mail.gmail.com
      [2] http://lkml.kernel.org/r/CAKOZueua_v8jHCpmEtTB6f3i9e2YnmX4mqdYVWhV4E=Z-n+zRQ@mail.gmail.com
      
      Fixes: 9c276cc6
      
       ("mm: introduce MADV_COLD")
      Reported-by: default avatarJann Horn <jannh@google.com>
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: "Joel Fernandes (Google)" <joel@joelfernandes.org>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20200312082248.GS23944@dhcp22.suse.cz
      
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      12e967fd
    • Chris Down's avatar
      mm, memcg: throttle allocators based on ancestral memory.high · e26733e0
      Chris Down authored
      Prior to this commit, we only directly check the affected cgroup's
      memory.high against its usage.  However, it's possible that we are being
      reclaimed as a result of hitting an ancestor memory.high and should be
      penalised based on that, instead.
      
      This patch changes memory.high overage throttling to use the largest
      overage in its ancestors when considering how many penalty jiffies to
      charge.  This makes sure that we penalise poorly behaving cgroups in the
      same way regardless of at what level of the hierarchy memory.high was
      breached.
      
      Fixes: 0e4b01df
      
       ("mm, memcg: throttle allocators when failing reclaim over memory.high")
      Reported-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarChris Down <chris@chrisdown.name>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nathan Chancellor <natechancellor@gmail.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: <stable@vger.kernel.org>	[5.4.x+]
      Link: http://lkml.kernel.org/r/8cd132f84bd7e16cdb8fde3378cdbf05ba00d387.1584036142.git.chris@chrisdown.name
      
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e26733e0
    • Chris Down's avatar
      mm, memcg: fix corruption on 64-bit divisor in memory.high throttling · d397a45f
      Chris Down authored
      Commit 0e4b01df had a bunch of fixups to use the right division
      method.  However, it seems that after all that it still wasn't right --
      div_u64 takes a 32-bit divisor.
      
      The headroom is still large (2^32 pages), so on mundane systems you
      won't hit this, but this should definitely be fixed.
      
      Fixes: 0e4b01df
      
       ("mm, memcg: throttle allocators when failing reclaim over memory.high")
      Reported-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarChris Down <chris@chrisdown.name>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nathan Chancellor <natechancellor@gmail.com>
      Cc: <stable@vger.kernel.org>	[5.4.x+]
      Link: http://lkml.kernel.org/r/80780887060514967d414b3cd91f9a316a16ab98.1584036142.git.chris@chrisdown.name
      
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d397a45f
    • Qian Cai's avatar
      page-flags: fix a crash at SetPageError(THP_SWAP) · d72520ad
      Qian Cai authored
      Commit bd4c82c2 ("mm, THP, swap: delay splitting THP after swapped
      out") supported writing THP to a swap device but forgot to upgrade an
      older commit df8c94d1 ("page-flags: define behavior of FS/IO-related
      flags on compound pages") which could trigger a crash during THP
      swapping out with DEBUG_VM_PGFLAGS=y,
      
        kernel BUG at include/linux/page-flags.h:317!
      
        page dumped because: VM_BUG_ON_PAGE(1 && PageCompound(page))
        page:fffff3b2ec3a8000 refcount:512 mapcount:0 mapping:000000009eb0338c index:0x7f6e58200 head:fffff3b2ec3a8000 order:9 compound_mapcount:0 compound_pincount:0
        anon flags: 0x45fffe0000d8454(uptodate|lru|workingset|owner_priv_1|writeback|head|reclaim|swapbacked)
      
        end_swap_bio_write()
          SetPageError(page)
            VM_BUG_ON_PAGE(1 && PageCompound(page))
      
        <IRQ>
        bio_endio+0x297/0x560
        dec_pending+0x218/0x430 [dm_mod]
        clone_endio+0xe4/0x2c0 [dm_mod]
        bio_endio+0x297/0x560
        blk_update_request+0x201/0x920
        scsi_end_request+0x6b/0x4b0
        scsi_io_completion+0x509/0x7e0
        scsi_finish_command+0x1ed/0x2a0
        scsi_softirq_done+0x1c9/0x1d0
        __blk_mqnterrupt+0xf/0x20
        </IRQ>
      
      Fix by checking PF_NO_TAIL in those places instead.
      
      Fixes: bd4c82c2
      
       ("mm, THP, swap: delay splitting THP after swapped out")
      Signed-off-by: default avatarQian Cai <cai@lca.pw>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Acked-by: default avatarRafael Aquini <aquini@redhat.com>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20200310235846.1319-1-cai@lca.pw
      
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d72520ad
    • Baoquan He's avatar
      mm/hotplug: fix hot remove failure in SPARSEMEM|!VMEMMAP case · d41e2f3b
      Baoquan He authored
      In section_deactivate(), pfn_to_page() doesn't work any more after
      ms->section_mem_map is resetting to NULL in SPARSEMEM|!VMEMMAP case.  It
      causes a hot remove failure:
      
        kernel BUG at mm/page_alloc.c:4806!
        invalid opcode: 0000 [#1] SMP PTI
        CPU: 3 PID: 8 Comm: kworker/u16:0 Tainted: G        W         5.5.0-next-20200205+ #340
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
        Workqueue: kacpi_hotplug acpi_hotplug_work_fn
        RIP: 0010:free_pages+0x85/0xa0
        Call Trace:
         __remove_pages+0x99/0xc0
         arch_remove_memory+0x23/0x4d
         try_remove_memory+0xc8/0x130
         __remove_memory+0xa/0x11
         acpi_memory_device_remove+0x72/0x100
         acpi_bus_trim+0x55/0x90
         acpi_device_hotplug+0x2eb/0x3d0
         acpi_hotplug_work_fn+0x1a/0x30
         process_one_work+0x1a7/0x370
         worker_thread+0x30/0x380
         kthread+0x112/0x130
         ret_from_fork+0x35/0x40
      
      Let's move the ->section_mem_map resetting after
      depopulate_section_memmap() to fix it.
      
      [akpm@linux-foundation.org: remove unneeded initialization, per David]
      Fixes: ba72b4c8
      
       ("mm/sparsemem: support sub-section hotplug")
      Signed-off-by: default avatarBaoquan He <bhe@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Wei Yang <richardw.yang@linux.intel.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20200307084229.28251-2-bhe@redhat.com
      
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d41e2f3b
    • Chunguang Xu's avatar
      memcg: fix NULL pointer dereference in __mem_cgroup_usage_unregister_event · 7d36665a
      Chunguang Xu authored
      An eventfd monitors multiple memory thresholds of the cgroup, closes them,
      the kernel deletes all events related to this eventfd.  Before all events
      are deleted, another eventfd monitors the memory threshold of this cgroup,
      leading to a crash:
      
        BUG: kernel NULL pointer dereference, address: 0000000000000004
        #PF: supervisor write access in kernel mode
        #PF: error_code(0x0002) - not-present page
        PGD 800000033058e067 P4D 800000033058e067 PUD 3355ce067 PMD 0
        Oops: 0002 [#1] SMP PTI
        CPU: 2 PID: 14012 Comm: kworker/2:6 Kdump: loaded Not tainted 5.6.0-rc4 #3
        Hardware name: LENOVO 20AWS01K00/20AWS01K00, BIOS GLET70WW (2.24 ) 05/21/2014
        Workqueue: events memcg_event_remove
        RIP: 0010:__mem_cgroup_usage_unregister_event+0xb3/0x190
        RSP: 0018:ffffb47e01c4fe18 EFLAGS: 00010202
        RAX: 0000000000000001 RBX: ffff8bb223a8a000 RCX: 0000000000000001
        RDX: 0000000000000001 RSI: ffff8bb22fb83540 RDI: 0000000000000001
        RBP: ffffb47e01c4fe48 R08: 0000000000000000 R09: 0000000000000010
        R10: 000000000000000c R11: 071c71c71c71c71c R12: ffff8bb226aba880
        R13: ffff8bb223a8a480 R14: 0000000000000000 R15: 0000000000000000
        FS:  0000000000000000(0000) GS:ffff8bb242680000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000004 CR3: 000000032c29c003 CR4: 00000000001606e0
        Call Trace:
          memcg_event_remove+0x32/0x90
          process_one_work+0x172/0x380
          worker_thread+0x49/0x3f0
          kthread+0xf8/0x130
          ret_from_fork+0x35/0x40
        CR2: 0000000000000004
      
      We can reproduce this problem in the following ways:
      
      1. We create a new cgroup subdirectory and a new eventfd, and then we
         monitor multiple memory thresholds of the cgroup through this eventfd.
      
      2.  closing this eventfd, and __mem_cgroup_usage_unregister_event ()
         will be called multiple times to delete all events related to this
         eventfd.
      
      The first time __mem_cgroup_usage_unregister_event() is called, the
      kernel will clear all items related to this eventfd in thresholds->
      primary.
      
      Since there is currently only one eventfd, thresholds-> primary becomes
      empty, so the kernel will set thresholds-> primary and hresholds-> spare
      to NULL.  If at this time, the user creates a new eventfd and monitor
      the memory threshold of this cgroup, kernel will re-initialize
      thresholds-> primary.
      
      Then when __mem_cgroup_usage_unregister_event () is called for the
      second time, because thresholds-> primary is not empty, the system will
      access thresholds-> spare, but thresholds-> spare is NULL, which will
      trigger a crash.
      
      In general, the longer it takes to delete all events related to this
      eventfd, the easier it is to trigger this problem.
      
      The solution is to check whether the thresholds associated with the
      eventfd has been cleared when deleting the event.  If so, we do nothing.
      
      [akpm@linux-foundation.org: fix comment, per Kirill]
      Fixes: 907860ed
      
       ("cgroups: make cftype.unregister_event() void-returning")
      Signed-off-by: default avatarChunguang Xu <brookxu@tencent.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/077a6f67-aefa-4591-efec-f2f3af2b0b02@gmail.com
      
      
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7d36665a
    • Linus Torvalds's avatar
      Merge tag 'block-5.6-20200320' of git://git.kernel.dk/linux-block · b74b991f
      Linus Torvalds authored
      Pull block fixes from Jens Axboe:
       "Just two NVMe fabrics fixes that should go into 5.6"
      
      * tag 'block-5.6-20200320' of git://git.kernel.dk/linux-block:
        nvmet-tcp: set MSG_MORE only if we actually have more to send
        nvme-rdma: Avoid double freeing of async event data
      b74b991f
    • Linus Torvalds's avatar
      Merge tag 'io_uring-5.6-20200320' of git://git.kernel.dk/linux-block · 1ab7ea1f
      Linus Torvalds authored
      Pull io_uring fixes from Jens Axboe:
       "Two different fixes in here:
      
         - Fix for a potential NULL pointer deref for links with async or
           drain marked (Pavel)
      
         - Fix for not properly checking RLIMIT_NOFILE for async punted
           operations.
      
           This affects openat/openat2, which were added this cycle, and
           accept4. I did a full audit of other cases where we might check
           current->signal->rlim[] and found only RLIMIT_FSIZE for buffered
           writes and fallocate. That one is fixed and queued for 5.7 and
           marked stable"
      
      * tag 'io_uring-5.6-20200320' of git://git.kernel.dk/linux-block:
        io_uring: make sure accept honor rlimit nofile
        io_uring: make sure openat/openat2 honor rlimit nofile
        io_uring: NULL-deref for IOSQE_{ASYNC,DRAIN}
      1ab7ea1f
    • Linus Torvalds's avatar
      Merge branch 'turbostat' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux · 6c1bae74
      Linus Torvalds authored
      Pull turbostat updates from Len Brown:
       "Update to turbostat v20.03.20.
      
        These patches unlock the full turbostat features for some new
        machines, plus a couple other minor tweaks"
      
      * 'turbostat' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux:
        tools/power turbostat: update version
        tools/power turbostat: Print cpuidle information
        tools/power turbostat: Fix 32-bit capabilities warning
        tools/power turbostat: Fix missing SYS_LPI counter on some Chromebooks
        tools/power turbostat: Support Elkhart Lake
        tools/power turbostat: Support Jasper Lake
        tools/power turbostat: Support Ice Lake server
        tools/power turbostat: Support Tiger Lake
        tools/power turbostat: Fix gcc build warnings
        tools/power turbostat: Support Cometlake
      6c1bae74
  4. Mar 21, 2020
    • Linus Torvalds's avatar
      Merge tag 'powerpc-5.6-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux · c63c50fc
      Linus Torvalds authored
      Pull powerpc fixes from Michael Ellerman:
       "Two fixes for bugs introduced this cycle:
      
         - fix a crash when shutting down a KVM PR guest (our original style
           of KVM which doesn't use hypervisor mode)
      
         - fix for the recently added 32-bit KASAN_VMALLOC support
      
        Thanks to: Christophe Leroy, Greg Kurz, Sean Christopherson"
      
      * tag 'powerpc-5.6-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
        KVM: PPC: Fix kernel crash with PR KVM
        powerpc/kasan: Fix shadow memory protection with CONFIG_KASAN_VMALLOC
      c63c50fc
    • Len Brown's avatar
      tools/power turbostat: update version · b95fffb9
      Len Brown authored
      
      
      A stitch in time saves nine.
      
      Signed-off-by: default avatarLen Brown <len.brown@intel.com>
      b95fffb9
    • Len Brown's avatar
      tools/power turbostat: Print cpuidle information · abdcbdb2
      Len Brown authored
      
      
      Print cpuidle driver and governor.
      
      Originally-by: default avatarAntti Laakso <antti.laakso@linux.intel.com>
      Signed-off-by: default avatarLen Brown <len.brown@intel.com>
      abdcbdb2
    • Jens Axboe's avatar
      Merge branch 'nvme-5.6-rc6' of git://git.infradead.org/nvme into block-5.6 · 83166ac8
      Jens Axboe authored
      Pull NVMe fixes from Keith:
      
      "Two late nvme fabrics fixes for 5.6: a double free with the rdma
       transport, and a regression fix for tcp; please pull."
      
      * 'nvme-5.6-rc6' of git://git.infradead.org/nvme:
        nvmet-tcp: set MSG_MORE only if we actually have more to send
        nvme-rdma: Avoid double freeing of async event data
      83166ac8
    • Filipe Manana's avatar
      btrfs: fix removal of raid[56|1c34} incompat flags after removing block group · d8e6fd5c
      Filipe Manana authored
      We are incorrectly dropping the raid56 and raid1c34 incompat flags when
      there are still raid56 and raid1c34 block groups, not when we do not any
      of those anymore. The logic just got unintentionally broken after adding
      the support for the raid1c34 modes.
      
      Fix this by clear the flags only if we do not have block groups with the
      respective profiles.
      
      Fixes: 9c907446
      
       ("btrfs: drop incompat bit for raid1c34 after last block group is gone")
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d8e6fd5c
    • Sagi Grimberg's avatar
      nvmet-tcp: set MSG_MORE only if we actually have more to send · 98fd5c72
      Sagi Grimberg authored
      When we send PDU data, we want to optimize the tcp stack
      operation if we have more data to send. So when we set MSG_MORE
      when:
      - We have more fragments coming in the batch, or
      - We have a more data to send in this PDU
      - We don't have a data digest trailer
      - We optimize with the SUCCESS flag and omit the NVMe completion
        (used if sq_head pointer update is disabled)
      
      This addresses a regression in QD=1 with SUCCESS flag optimization
      as we unconditionally set MSG_MORE when we didn't actually have
      more data to send.
      
      Fixes: 70583295
      
       ("nvmet-tcp: implement C2HData SUCCESS optimization")
      Reported-by: default avatarMark Wunderlich <mark.wunderlich@intel.com>
      Tested-by: default avatarMark Wunderlich <mark.wunderlich@intel.com>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarKeith Busch <kbusch@kernel.org>
      98fd5c72
    • Linus Torvalds's avatar
      Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux · 5ad0ec0b
      Linus Torvalds authored
      Pull arm64 fixes from Will Deacon:
      
       - Fix panic() when it occurs during secondary CPU startup
      
       - Fix "kpti=off" when KASLR is enabled
      
       - Fix howler in compat syscall table for vDSO clock_getres() fallback
      
      * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
        arm64: compat: Fix syscall number of compat_clock_getres
        arm64: kpti: Fix "kpti=off" when KASLR is enabled
        arm64: smp: fix crash_smp_send_stop() behaviour
        arm64: smp: fix smp_send_stop() behaviour
      5ad0ec0b
    • Linus Torvalds's avatar
      Merge tag 'char-misc-5.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc · f014d2b8
      Linus Torvalds authored
      Pull char/misc driver fixes from Greg KH:
       "Here are some small different driver fixes for 5.6-rc7:
      
         - binderfs fix, yet again
      
         - slimbus new device id added
      
         - hwtracing bugfixes for reported issues and a new device id
      
        All of these have been in linux-next with no reported issues"
      
      * tag 'char-misc-5.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
        intel_th: pci: Add Elkhart Lake CPU support
        intel_th: Fix user-visible error codes
        intel_th: msu: Fix the unexpected state warning
        stm class: sys-t: Fix the use of time_after()
        slimbus: ngd: add v2.1.0 compatible
        binderfs: use refcount for binder control devices too
      f014d2b8
    • Linus Torvalds's avatar
      Merge tag 'staging-5.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging · 3bd14829
      Linus Torvalds authored
      Pull staging/IIO fixes from Greg KH:
       "Here are a number of small staging and IIO driver fixes for 5.6-rc7
      
        Nothing major here, just resolutions for some reported problems:
         - iio bugfixes for a number of different drivers
         - greybus loopback_test fixes
         - wfx driver fixes
      
        All of these have been in linux-next with no reported issues"
      
      * tag 'staging-5.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
        staging: rtl8188eu: Add device id for MERCUSYS MW150US v2
        staging: greybus: loopback_test: fix potential path truncations
        staging: greybus: loopback_test: fix potential path truncation
        staging: greybus: loopback_test: fix poll-mask build breakage
        staging: wfx: fix RCU usage between hif_join() and ieee80211_bss_get_ie()
        staging: wfx: fix RCU usage in wfx_join_finalize()
        staging: wfx: make warning about pending frame less scary
        staging: wfx: fix lines ending with a comma instead of a semicolon
        staging: wfx: fix warning about freeing in-use mutex during device unregister
        staging/speakup: fix get_word non-space look-ahead
        iio: ping: set pa_laser_ping_cfg in of_ping_match
        iio: chemical: sps30: fix missing triggered buffer dependency
        iio: st_sensors: remap SMO8840 to LIS2DH12
        iio: light: vcnl4000: update sampling periods for vcnl4040
        iio: light: vcnl4000: update sampling periods for vcnl4200
        iio: accel: adxl372: Set iio_chan BE
        iio: magnetometer: ak8974: Fix negative raw values in sysfs
        iio: trigger: stm32-timer: disable master mode when stopping
        iio: adc: stm32-dfsdm: fix sleep in atomic context
        iio: adc: at91-sama5d2_adc: fix differential channels in triggered mode
      3bd14829
    • Linus Torvalds's avatar
      Merge tag 'usb-5.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb · b07c2e76
      Linus Torvalds authored
      Pull USB fixes from Greg KH:
       "Here are some small USB fixes for 5.6-rc7. And there's a thunderbolt
        driver fix thrown in for good measure as well.
      
        These fixes are:
         - new device ids for usb-serial drivers
         - thunderbolt error code fix
         - xhci driver fixes
         - typec fixes
         - cdc-acm driver fixes
         - chipidea driver fix
         - more USB quirks added for devices that need them.
      
        All of these have been in linux-next with no reported issues"
      
      * tag 'usb-5.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
        USB: cdc-acm: fix rounding error in TIOCSSERIAL
        USB: cdc-acm: fix close_delay and closing_wait units in TIOCSSERIAL
        usb: quirks: add NO_LPM quirk for RTL8153 based ethernet adapters
        usb: chipidea: udc: fix sleeping function called from invalid context
        USB: serial: pl2303: add device-id for HP LD381
        USB: serial: option: add ME910G1 ECM composition 0x110b
        usb: host: xhci-plat: add a shutdown
        usb: typec: ucsi: displayport: Fix a potential race during registration
        usb: typec: ucsi: displayport: Fix NULL pointer dereference
        USB: Disable LPM on WD19's Realtek Hub
        usb: xhci: apply XHCI_SUSPEND_DELAY to AMD XHCI controller 1022:145c
        xhci: Do not open code __print_symbolic() in xhci trace events
        thunderbolt: Fix error code in tb_port_is_width_supported()
      b07c2e76
    • Linus Torvalds's avatar
      Merge tag 'tty-5.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty · fa91418b
      Linus Torvalds authored
      Pull tty fixes from Greg KH:
       "Here are three small tty_io bugfixes for reported issues that Eric has
        resolved for 5.6-rc7
      
        All of these have been in linux-next with no reported issues"
      
      * tag 'tty-5.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
        tty: fix compat TIOCGSERIAL checking wrong function ptr
        tty: fix compat TIOCGSERIAL leaking uninitialized memory
        tty: drop outdated comments about release_tty() locking
      fa91418b
    • Linus Torvalds's avatar
      Merge tag 'sound-5.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound · 12bf19c9
      Linus Torvalds authored
      Pull sound fixes from Takashi Iwai:
       "A few fixes covering the issues reported by syzkaller, a couple of
        fixes for the MIDI decoding bug, and a few usual HD-audio quirks.
      
        Some of them are about ALSA core stuff, but they are small fixes just
        for corner cases, and nothing thrilling"
      
      * tag 'sound-5.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
        ALSA: hda/realtek - Enable the headset of Acer N50-600 with ALC662
        ALSA: hda/realtek - Enable headset mic of Acer X2660G with ALC662
        ALSA: seq: oss: Fix running status after receiving sysex
        ALSA: seq: virmidi: Fix running status after receiving sysex
        ALSA: pcm: oss: Remove WARNING from snd_pcm_plug_alloc() checks
        ALSA: hda/realtek: Fix pop noise on ALC225
        ALSA: line6: Fix endless MIDI read loop
        ALSA: pcm: oss: Avoid plugin buffer overflow
      12bf19c9