Skip to content
  1. Feb 11, 2020
    • Daniel Verkamp's avatar
      virtio-balloon: initialize all vq callbacks · ab9377a8
      Daniel Verkamp authored
      commit 5790b533 upstream.
      
      Ensure that elements of the callbacks array that correspond to
      unavailable features are set to NULL; previously, they would be left
      uninitialized.
      
      Since the corresponding names array elements were explicitly set to
      NULL, the uninitialized callback pointers would not actually be
      dereferenced; however, the uninitialized callbacks elements would still
      be read in vp_find_vqs_msix() and used to calculate the number of MSI-X
      vectors required.
      
      Cc: stable@vger.kernel.org
      Fixes: 86a55978
      
       ("virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT")
      Reviewed-by: default avatarCornelia Huck <cohuck@redhat.com>
      Signed-off-by: default avatarDaniel Verkamp <dverkamp@chromium.org>
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ab9377a8
    • Matt Coffin's avatar
      drm/amdgpu/smu_v11_0: Correct behavior of restoring default tables (v2) · 07a04e9f
      Matt Coffin authored
      commit 93c5f1f6
      
       upstream.
      
      Previously, the syfs functionality for restoring the default powerplay
      table was sourcing it's information from the currently-staged powerplay
      table.
      
      This patch adds a step to cache the first overdrive table that we see on
      boot, so that it can be used later to "restore" the powerplay table
      
      v2: sqaush my original with Matt's fix
      
      Bug: https://gitlab.freedesktop.org/drm/amd/issues/1020
      Signed-off-by: default avatarMatt Coffin <mcoffin13@gmail.com>
      Reviewed-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Cc: stable@vger.kernel.org # 5.5.x
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      07a04e9f
    • Alex Deucher's avatar
      drm/amdgpu/navi10: add OD_RANGE for navi overclocking · e5f2f739
      Alex Deucher authored
      commit ee23a518
      
       upstream.
      
      So users can see the range of valid values.
      
      Bug: https://gitlab.freedesktop.org/drm/amd/issues/1020
      Reviewed-by: default avatarEvan Quan <evan.quan@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Cc: stable@vger.kernel.org # 5.5.x
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e5f2f739
    • Alex Deucher's avatar
      drm/amdgpu: fetch default VDDC curve voltages (v2) · de755808
      Alex Deucher authored
      commit 0531aa6e
      
       upstream.
      
      Ask the SMU for the default VDDC curve voltage values.  This
      properly reports the VDDC values in the OD interface.
      
      v2: only update if the original values are 0
      
      Bug: https://gitlab.freedesktop.org/drm/amd/issues/1020
      Reviewed-by: default avatarEvan Quan <evan.quan@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Cc: stable@vger.kernel.org # 5.5.x
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      de755808
    • Lyude Paul's avatar
      drm/amd/dm/mst: Ignore payload update failures · 41fa3042
      Lyude Paul authored
      commit 58fe03d6
      
       upstream.
      
      Disabling a display on MST can potentially happen after the entire MST
      topology has been removed, which means that we can't communicate with
      the topology at all in this scenario. Likewise, this also means that we
      can't properly update payloads on the topology and as such, it's a good
      idea to ignore payload update failures when disabling displays.
      Currently, amdgpu makes the mistake of halting the payload update
      process when any payload update failures occur, resulting in leaving
      DC's local copies of the payload tables out of date.
      
      This ends up causing problems with hotplugging MST topologies, and
      causes modesets on the second hotplug to fail like so:
      
      [drm] Failed to updateMST allocation table forpipe idx:1
      ------------[ cut here ]------------
      WARNING: CPU: 5 PID: 1511 at
      drivers/gpu/drm/amd/amdgpu/../display/dc/core/dc_link.c:2677
      update_mst_stream_alloc_table+0x11e/0x130 [amdgpu]
      Modules linked in: cdc_ether usbnet fuse xt_conntrack nf_conntrack
      nf_defrag_ipv6 libcrc32c nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4
      nft_counter nft_compat nf_tables nfnetlink tun bridge stp llc sunrpc
      vfat fat wmi_bmof uvcvideo snd_hda_codec_realtek snd_hda_codec_generic
      snd_hda_codec_hdmi videobuf2_vmalloc snd_hda_intel videobuf2_memops
      videobuf2_v4l2 snd_intel_dspcfg videobuf2_common crct10dif_pclmul
      snd_hda_codec videodev crc32_pclmul snd_hwdep snd_hda_core
      ghash_clmulni_intel snd_seq mc joydev pcspkr snd_seq_device snd_pcm
      sp5100_tco k10temp i2c_piix4 snd_timer thinkpad_acpi ledtrig_audio snd
      wmi soundcore video i2c_scmi acpi_cpufreq ip_tables amdgpu(O)
      rtsx_pci_sdmmc amd_iommu_v2 gpu_sched mmc_core i2c_algo_bit ttm
      drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec drm
      crc32c_intel serio_raw hid_multitouch r8152 mii nvme r8169 nvme_core
      rtsx_pci pinctrl_amd
      CPU: 5 PID: 1511 Comm: gnome-shell Tainted: G           O      5.5.0-rc7Lyude-Test+ #4
      Hardware name: LENOVO FA495SIT26/FA495SIT26, BIOS R12ET22W(0.22 ) 01/31/2019
      RIP: 0010:update_mst_stream_alloc_table+0x11e/0x130 [amdgpu]
      Code: 28 00 00 00 75 2b 48 8d 65 e0 5b 41 5c 41 5d 41 5e 5d c3 0f b6 06
      49 89 1c 24 41 88 44 24 08 0f b6 46 01 41 88 44 24 09 eb 93 <0f> 0b e9
      2f ff ff ff e8 a6 82 a3 c2 66 0f 1f 44 00 00 0f 1f 44 00
      RSP: 0018:ffffac428127f5b0 EFLAGS: 00010202
      RAX: 0000000000000002 RBX: ffff8d1e166eee80 RCX: 0000000000000000
      RDX: ffffac428127f668 RSI: ffff8d1e166eee80 RDI: ffffac428127f610
      RBP: ffffac428127f640 R08: ffffffffc03d94a8 R09: 0000000000000000
      R10: ffff8d1e24b02000 R11: ffffac428127f5b0 R12: ffff8d1e1b83d000
      R13: ffff8d1e1bea0b08 R14: 0000000000000002 R15: 0000000000000002
      FS:  00007fab23ffcd80(0000) GS:ffff8d1e28b40000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f151f1711e8 CR3: 00000005997c0000 CR4: 00000000003406e0
      Call Trace:
       ? mutex_lock+0xe/0x30
       dc_link_allocate_mst_payload+0x9a/0x210 [amdgpu]
       ? dm_read_reg_func+0x39/0xb0 [amdgpu]
       ? core_link_enable_stream+0x656/0x730 [amdgpu]
       core_link_enable_stream+0x656/0x730 [amdgpu]
       dce110_apply_ctx_to_hw+0x58e/0x5d0 [amdgpu]
       ? dcn10_verify_allow_pstate_change_high+0x1d/0x280 [amdgpu]
       ? dcn10_wait_for_mpcc_disconnect+0x3c/0x130 [amdgpu]
       dc_commit_state+0x292/0x770 [amdgpu]
       ? add_timer+0x101/0x1f0
       ? ttm_bo_put+0x1a1/0x2f0 [ttm]
       amdgpu_dm_atomic_commit_tail+0xb59/0x1ff0 [amdgpu]
       ? amdgpu_move_blit.constprop.0+0xb8/0x1f0 [amdgpu]
       ? amdgpu_bo_move+0x16d/0x2b0 [amdgpu]
       ? ttm_bo_handle_move_mem+0x118/0x570 [ttm]
       ? ttm_bo_validate+0x134/0x150 [ttm]
       ? dm_plane_helper_prepare_fb+0x1b9/0x2a0 [amdgpu]
       ? _cond_resched+0x15/0x30
       ? wait_for_completion_timeout+0x38/0x160
       ? _cond_resched+0x15/0x30
       ? wait_for_completion_interruptible+0x33/0x190
       commit_tail+0x94/0x130 [drm_kms_helper]
       drm_atomic_helper_commit+0x113/0x140 [drm_kms_helper]
       drm_atomic_helper_set_config+0x70/0xb0 [drm_kms_helper]
       drm_mode_setcrtc+0x194/0x6a0 [drm]
       ? _cond_resched+0x15/0x30
       ? mutex_lock+0xe/0x30
       ? drm_mode_getcrtc+0x180/0x180 [drm]
       drm_ioctl_kernel+0xaa/0xf0 [drm]
       drm_ioctl+0x208/0x390 [drm]
       ? drm_mode_getcrtc+0x180/0x180 [drm]
       amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
       do_vfs_ioctl+0x458/0x6d0
       ksys_ioctl+0x5e/0x90
       __x64_sys_ioctl+0x16/0x20
       do_syscall_64+0x55/0x1b0
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      RIP: 0033:0x7fab2121f87b
      Code: 0f 1e fa 48 8b 05 0d 96 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff
      ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01
      f0 ff ff 73 01 c3 48 8b 0d dd 95 2c 00 f7 d8 64 89 01 48
      RSP: 002b:00007ffd045f9068 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
      RAX: ffffffffffffffda RBX: 00007ffd045f90a0 RCX: 00007fab2121f87b
      RDX: 00007ffd045f90a0 RSI: 00000000c06864a2 RDI: 000000000000000b
      RBP: 00007ffd045f90a0 R08: 0000000000000000 R09: 000055dbd2985d10
      R10: 000055dbd2196280 R11: 0000000000000246 R12: 00000000c06864a2
      R13: 000000000000000b R14: 0000000000000000 R15: 000055dbd2196280
      ---[ end trace 6ea888c24d2059cd ]---
      
      Note as well, I have only been able to reproduce this on setups with 2
      MST displays.
      
      Changes since v1:
      * Don't return false when part 1 or part 2 of updating the payloads
        fails, we don't want to abort at any step of the process even if
        things fail
      
      Reviewed-by: default avatarMikita Lipski <Mikita.Lipski@amd.com>
      Signed-off-by: default avatarLyude Paul <lyude@redhat.com>
      Acked-by: default avatarHarry Wentland <harry.wentland@amd.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      41fa3042
    • Evan Quan's avatar
      drm/amd/powerplay: fix navi10 system intermittent reboot issue V2 · 5ce61bb5
      Evan Quan authored
      commit 1cf8c930
      
       upstream.
      
      This workaround is needed only for Navi10 12 Gbps SKUs.
      
      V2: added SMU firmware version guard
      
      Signed-off-by: default avatarEvan Quan <evan.quan@amd.com>
      Reviewed-by: default avatarFeifei Xu <Feifei.Xu@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5ce61bb5
    • Thierry Reding's avatar
      drm/tegra: Reuse IOVA mapping where possible · 51df2364
      Thierry Reding authored
      commit 273da5a0
      
       upstream.
      
      This partially reverts the DMA API support that was recently merged
      because it was causing performance regressions on older Tegra devices.
      Unfortunately, the cache maintenance performed by dma_map_sg() and
      dma_unmap_sg() causes performance to drop by a factor of 10.
      
      The right solution for this would be to cache mappings for buffers per
      consumer device, but that's a bit involved. Instead, we simply revert to
      the old behaviour of sharing IOVA mappings when we know that devices can
      do so (i.e. they share the same IOMMU domain).
      
      Cc: <stable@vger.kernel.org> # v5.5
      Reported-by: default avatarDmitry Osipenko <digetx@gmail.com>
      Signed-off-by: default avatarThierry Reding <treding@nvidia.com>
      Tested-by: default avatarDmitry Osipenko <digetx@gmail.com>
      Reviewed-by: default avatarDmitry Osipenko <digetx@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      51df2364
    • Thierry Reding's avatar
      drm/tegra: Relax IOMMU usage criteria on old Tegra · d71c74b1
      Thierry Reding authored
      commit 2d9384ff
      
       upstream.
      
      Older Tegra devices only allow addressing 32 bits of memory, so whether
      or not the host1x is attached to an IOMMU doesn't matter. host1x IOMMU
      attachment is only needed on devices that can address memory beyond the
      32-bit boundary and where the host1x doesn't support the wide GATHER
      opcode that allows it to access buffers at higher addresses.
      
      Cc: <stable@vger.kernel.org> # v5.5
      Signed-off-by: default avatarThierry Reding <treding@nvidia.com>
      Tested-by: default avatarDmitry Osipenko <digetx@gmail.com>
      Reviewed-by: default avatarDmitry Osipenko <digetx@gmail.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d71c74b1
    • Alex Deucher's avatar
      drm/amdgpu/navi: fix index for OD MCLK · 8420e2c1
      Alex Deucher authored
      commit 45826e9c
      
       upstream.
      
      You can only adjust the max mclk, not the min.
      
      Bug: https://gitlab.freedesktop.org/drm/amd/issues/1020
      Reviewed-by: default avatarEvan Quan <evan.quan@amd.com>
      Signed-off-by: default avatarAlex Deucher <alexander.deucher@amd.com>
      Cc: stable@vger.kernel.org # 5.5.x
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8420e2c1
    • Stephen Warren's avatar
      clk: tegra: Mark fuse clock as critical · d0e381ca
      Stephen Warren authored
      commit bf83b96f
      
       upstream.
      
      For a little over a year, U-Boot on Tegra124 has configured the flow
      controller to perform automatic RAM re-repair on off->on power
      transitions of the CPU rail[1]. This is mandatory for correct operation
      of Tegra124. However, RAM re-repair relies on certain clocks, which the
      kernel must enable and leave running. The fuse clock is one of those
      clocks. Mark this clock as critical so that LP1 power mode (system
      suspend) operates correctly.
      
      [1] 3cc7942a4ae5 ARM: tegra: implement RAM repair
      
      Reported-by: default avatarJonathan Hunter <jonathanh@nvidia.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarStephen Warren <swarren@nvidia.com>
      Signed-off-by: default avatarThierry Reding <treding@nvidia.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d0e381ca
    • Peter Zijlstra's avatar
      mm/mmu_gather: invalidate TLB correctly on batch allocation failure and flush · fa17a800
      Peter Zijlstra authored
      commit 0ed13259 upstream.
      
      Architectures for which we have hardware walkers of Linux page table
      should flush TLB on mmu gather batch allocation failures and batch flush.
      Some architectures like POWER supports multiple translation modes (hash
      and radix) and in the case of POWER only radix translation mode needs the
      above TLBI.  This is because for hash translation mode kernel wants to
      avoid this extra flush since there are no hardware walkers of linux page
      table.  With radix translation, the hardware also walks linux page table
      and with that, kernel needs to make sure to TLB invalidate page walk cache
      before page table pages are freed.
      
      More details in commit d86564a2 ("mm/tlb, x86/mm: Support invalidating
      TLB caches for RCU_TABLE_FREE")
      
      The changes to sparc are to make sure we keep the old behavior since we
      are now removing HAVE_RCU_TABLE_NO_INVALIDATE.  The default value for
      tlb_needs_table_invalidate is to always force an invalidate and sparc can
      avoid the table invalidate.  Hence we define tlb_needs_table_invalidate to
      false for sparc architecture.
      
      Link: http://lkml.kernel.org/r/20200116064531.483522-3-aneesh.kumar@linux.ibm.com
      Fixes: a46cc7a9
      
       ("powerpc/mm/radix: Improve TLB/PWC flushes")
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Acked-by: Michael Ellerman <mpe@ellerman.id.au>	[powerpc]
      Cc: <stable@vger.kernel.org>	[4.14+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fa17a800
    • Niklas Cassel's avatar
      arm64: dts: qcom: qcs404-evb: Set vdd_apc regulator in high power mode · 53d35a00
      Niklas Cassel authored
      commit eac8ce86 upstream.
      
      vdd_apc is the regulator that supplies the main CPU cluster.
      
      At sudden CPU load changes, we have noticed invalid page faults on
      addresses with all bits shifted, as well as on addresses with individual
      bits flipped.
      
      By putting the vdd_apc regulator in high power mode, the voltage drops
      during sudden load changes will be less severe, and we have not been able
      to reproduce the invalid page faults with the regulator in this mode.
      
      Fixes: 8faea8ed
      
       ("arm64: dts: qcom: qcs404-evb: add spmi regulators")
      Cc: stable@vger.kernel.org
      Suggested-by: default avatarBjorn Andersson <bjorn.andersson@linaro.org>
      Signed-off-by: default avatarNiklas Cassel <niklas.cassel@linaro.org>
      Reviewed-by: default avatarVinod Koul <vkoul@kernel.org>
      Link: https://lore.kernel.org/r/20191014120920.12691-1-niklas.cassel@linaro.org
      Signed-off-by: default avatarBjorn Andersson <bjorn.andersson@linaro.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      53d35a00
    • David Hildenbrand's avatar
      mm/page_alloc.c: fix uninitialized memmaps on a partially populated last section · 945afc5b
      David Hildenbrand authored
      commit e822969c upstream.
      
      Patch series "mm: fix max_pfn not falling on section boundary", v2.
      
      Playing with different memory sizes for a x86-64 guest, I discovered that
      some memmaps (highest section if max_mem does not fall on the section
      boundary) are marked as being valid and online, but contain garbage.  We
      have to properly initialize these memmaps.
      
      Looking at /proc/kpageflags and friends, I found some more issues,
      partially related to this.
      
      This patch (of 3):
      
      If max_pfn is not aligned to a section boundary, we can easily run into
      BUGs.  This can e.g., be triggered on x86-64 under QEMU by specifying a
      memory size that is not a multiple of 128MB (e.g., 4097MB, but also
      4160MB).  I was told that on real HW, we can easily have this scenario
      (esp., one of the main reasons sub-section hotadd of devmem was added).
      
      The issue is, that we have a valid memmap (pfn_valid()) for the whole
      section, and the whole section will be marked "online".
      pfn_to_online_page() will succeed, but the memmap contains garbage.
      
      E.g., doing a "./page-types -r -a 0x144001" when QEMU was started with "-m
      4160M" - (see tools/vm/page-types.c):
      
      [  200.476376] BUG: unable to handle page fault for address: fffffffffffffffe
      [  200.477500] #PF: supervisor read access in kernel mode
      [  200.478334] #PF: error_code(0x0000) - not-present page
      [  200.479076] PGD 59614067 P4D 59614067 PUD 59616067 PMD 0
      [  200.479557] Oops: 0000 [#4] SMP NOPTI
      [  200.479875] CPU: 0 PID: 603 Comm: page-types Tainted: G      D W         5.5.0-rc1-next-20191209 #93
      [  200.480646] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu4
      [  200.481648] RIP: 0010:stable_page_flags+0x4d/0x410
      [  200.482061] Code: f3 ff 41 89 c0 48 b8 00 00 00 00 01 00 00 00 45 84 c0 0f 85 cd 02 00 00 48 8b 53 08 48 8b 2b 48f
      [  200.483644] RSP: 0018:ffffb139401cbe60 EFLAGS: 00010202
      [  200.484091] RAX: fffffffffffffffe RBX: fffffbeec5100040 RCX: 0000000000000000
      [  200.484697] RDX: 0000000000000001 RSI: ffffffff9535c7cd RDI: 0000000000000246
      [  200.485313] RBP: ffffffffffffffff R08: 0000000000000000 R09: 0000000000000000
      [  200.485917] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000144001
      [  200.486523] R13: 00007ffd6ba55f48 R14: 00007ffd6ba55f40 R15: ffffb139401cbf08
      [  200.487130] FS:  00007f68df717580(0000) GS:ffff9ec77fa00000(0000) knlGS:0000000000000000
      [  200.487804] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  200.488295] CR2: fffffffffffffffe CR3: 0000000135d48000 CR4: 00000000000006f0
      [  200.488897] Call Trace:
      [  200.489115]  kpageflags_read+0xe9/0x140
      [  200.489447]  proc_reg_read+0x3c/0x60
      [  200.489755]  vfs_read+0xc2/0x170
      [  200.490037]  ksys_pread64+0x65/0xa0
      [  200.490352]  do_syscall_64+0x5c/0xa0
      [  200.490665]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      But it can be triggered much easier via "cat /proc/kpageflags > /dev/null"
      after cold/hot plugging a DIMM to such a system:
      
      [root@localhost ~]# cat /proc/kpageflags > /dev/null
      [  111.517275] BUG: unable to handle page fault for address: fffffffffffffffe
      [  111.517907] #PF: supervisor read access in kernel mode
      [  111.518333] #PF: error_code(0x0000) - not-present page
      [  111.518771] PGD a240e067 P4D a240e067 PUD a2410067 PMD 0
      
      This patch fixes that by at least zero-ing out that memmap (so e.g.,
      page_to_pfn() will not crash).  Commit 907ec5fc ("mm: zero remaining
      unavailable struct pages") tried to fix a similar issue, but forgot to
      consider this special case.
      
      After this patch, there are still problems to solve.  E.g., not all of
      these pages falling into a memory hole will actually get initialized later
      and set PageReserved - they are only zeroed out - but at least the
      immediate crashes are gone.  A follow-up patch will take care of this.
      
      Link: http://lkml.kernel.org/r/20191211163201.17179-2-david@redhat.com
      Fixes: f7f99100
      
       ("mm: stop zeroing memory during allocation in vmemmap")
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Tested-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Bob Picco <bob.picco@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: <stable@vger.kernel.org>	[4.15+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      945afc5b
    • Gang He's avatar
      ocfs2: fix oops when writing cloned file · ab751a85
      Gang He authored
      commit 2d797e9f upstream.
      
      Writing a cloned file triggers a kernel oops and the user-space command
      process is also killed by the system.  The bug can be reproduced stably
      via:
      
      1) create a file under ocfs2 file system directory.
      
        journalctl -b > aa.txt
      
      2) create a cloned file for this file.
      
        reflink aa.txt bb.txt
      
      3) write the cloned file with dd command.
      
        dd if=/dev/zero of=bb.txt bs=512 count=1 conv=notrunc
      
      The dd command is killed by the kernel, then you can see the oops message
      via dmesg command.
      
      [  463.875404] BUG: kernel NULL pointer dereference, address: 0000000000000028
      [  463.875413] #PF: supervisor read access in kernel mode
      [  463.875416] #PF: error_code(0x0000) - not-present page
      [  463.875418] PGD 0 P4D 0
      [  463.875425] Oops: 0000 [#1] SMP PTI
      [  463.875431] CPU: 1 PID: 2291 Comm: dd Tainted: G           OE     5.3.16-2-default
      [  463.875433] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
      [  463.875500] RIP: 0010:ocfs2_refcount_cow+0xa4/0x5d0 [ocfs2]
      [  463.875505] Code: 06 89 6c 24 38 89 eb f6 44 24 3c 02 74 be 49 8b 47 28
      [  463.875508] RSP: 0018:ffffa2cb409dfce8 EFLAGS: 00010202
      [  463.875512] RAX: ffff8b1ebdca8000 RBX: 0000000000000001 RCX: ffff8b1eb73a9df0
      [  463.875515] RDX: 0000000000056a01 RSI: 0000000000000000 RDI: 0000000000000000
      [  463.875517] RBP: 0000000000000001 R08: ffff8b1eb73a9de0 R09: 0000000000000000
      [  463.875520] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
      [  463.875522] R13: ffff8b1eb922f048 R14: 0000000000000000 R15: ffff8b1eb922f048
      [  463.875526] FS:  00007f8f44d15540(0000) GS:ffff8b1ebeb00000(0000) knlGS:0000000000000000
      [  463.875529] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  463.875532] CR2: 0000000000000028 CR3: 000000003c17a000 CR4: 00000000000006e0
      [  463.875546] Call Trace:
      [  463.875596]  ? ocfs2_inode_lock_full_nested+0x18b/0x960 [ocfs2]
      [  463.875648]  ocfs2_file_write_iter+0xaf8/0xc70 [ocfs2]
      [  463.875672]  new_sync_write+0x12d/0x1d0
      [  463.875688]  vfs_write+0xad/0x1a0
      [  463.875697]  ksys_write+0xa1/0xe0
      [  463.875710]  do_syscall_64+0x60/0x1f0
      [  463.875743]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [  463.875758] RIP: 0033:0x7f8f4482ed44
      [  463.875762] Code: 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 80 00 00 00
      [  463.875765] RSP: 002b:00007fff300a79d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      [  463.875769] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f8f4482ed44
      [  463.875771] RDX: 0000000000000200 RSI: 000055f771b5c000 RDI: 0000000000000001
      [  463.875774] RBP: 0000000000000200 R08: 00007f8f44af9c78 R09: 0000000000000003
      [  463.875776] R10: 000000000000089f R11: 0000000000000246 R12: 000055f771b5c000
      [  463.875779] R13: 0000000000000200 R14: 0000000000000000 R15: 000055f771b5c000
      
      This regression problem was introduced by commit e74540b2 ("ocfs2:
      protect extent tree in ocfs2_prepare_inode_for_write()").
      
      Link: http://lkml.kernel.org/r/20200121050153.13290-1-ghe@suse.com
      Fixes: e74540b2
      
       ("ocfs2: protect extent tree in ocfs2_prepare_inode_for_write()").
      Signed-off-by: default avatarGang He <ghe@suse.com>
      Reviewed-by: default avatarJoseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Changwei Ge <gechangwei@live.cn>
      Cc: Jun Piao <piaojun@huawei.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ab751a85
    • Paolo Bonzini's avatar
      KVM: x86: use raw clock values consistently · 3b1cc46f
      Paolo Bonzini authored
      commit 8171cd68 upstream.
      
      Commit 53fafdbb ("KVM: x86: switch KVMCLOCK base to monotonic raw
      clock") changed kvmclock to use tkr_raw instead of tkr_mono.  However,
      the default kvmclock_offset for the VM was still based on the monotonic
      clock and, if the raw clock drifted enough from the monotonic clock,
      this could cause a negative system_time to be written to the guest's
      struct pvclock.  RHEL5 does not like it and (if it boots fast enough to
      observe a negative time value) it hangs.
      
      There is another thing to be careful about: getboottime64 returns the
      host boot time with tkr_mono frequency, and subtracting the tkr_raw-based
      kvmclock value will cause the wallclock to be off if tkr_raw drifts
      from tkr_mono.  To avoid this, compute the wallclock delta from the
      current time instead of being clever and using getboottime64.
      
      Fixes: 53fafdbb
      
       ("KVM: x86: switch KVMCLOCK base to monotonic raw clock")
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3b1cc46f
    • Paolo Bonzini's avatar
      KVM: x86: reorganize pvclock_gtod_data members · 8d617ad2
      Paolo Bonzini authored
      commit 917f9475 upstream.
      
      We will need a copy of tk->offs_boot in the next patch.  Store it and
      cleanup the struct: instead of storing tk->tkr_xxx.base with the tk->offs_boot
      included, store the raw value in struct pvclock_clock and sum it in
      do_monotonic_raw and do_realtime.   tk->tkr_xxx.xtime_nsec also moves
      to struct pvclock_clock.
      
      While at it, fix a (usually harmless) typo in do_monotonic_raw, which
      was using gtod->clock.shift instead of gtod->raw_clock.shift.
      
      Fixes: 53fafdbb
      
       ("KVM: x86: switch KVMCLOCK base to monotonic raw clock")
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8d617ad2
    • Christian Borntraeger's avatar
      KVM: s390: do not clobber registers during guest reset/store status · 16f536eb
      Christian Borntraeger authored
      commit 55680890 upstream.
      
      The initial CPU reset clobbers the userspace fpc and the store status
      ioctl clobbers the guest acrs + fpr.  As these calls are only done via
      ioctl (and not via vcpu_run), no CPU context is loaded, so we can (and
      must) act directly on the sync regs, not on the thread context.
      
      Cc: stable@kernel.org
      Fixes: e1788bb9 ("KVM: s390: handle floating point registers in the run ioctl not in vcpu_put/load")
      Fixes: 31d8b8d4
      
       ("KVM: s390: handle access registers in the run ioctl not in vcpu_put/load")
      Signed-off-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarCornelia Huck <cohuck@redhat.com>
      Signed-off-by: default avatarJanosch Frank <frankja@linux.ibm.com>
      Link: https://lore.kernel.org/r/20200131100205.74720-2-frankja@linux.ibm.com
      Signed-off-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      16f536eb
    • Sean Christopherson's avatar
      KVM: x86: Revert "KVM: X86: Fix fpu state crash in kvm guest" · dc887fb2
      Sean Christopherson authored
      commit 2620fe26 upstream.
      
      Reload the current thread's FPU state, which contains the guest's FPU
      state, to the CPU registers if necessary during vcpu_enter_guest().
      TIF_NEED_FPU_LOAD can be set any time control is transferred out of KVM,
      e.g. if I/O is triggered during a KVM call to get_user_pages() or if a
      softirq occurs while KVM is scheduled in.
      
      Moving the handling of TIF_NEED_FPU_LOAD from vcpu_enter_guest() to
      kvm_arch_vcpu_load(), effectively kvm_sched_in(), papered over a bug
      where kvm_put_guest_fpu() failed to account for TIF_NEED_FPU_LOAD.  The
      easiest way to the kvm_put_guest_fpu() bug was to run with involuntary
      preemption enable, thus handling TIF_NEED_FPU_LOAD during kvm_sched_in()
      made the bug go away.  But, removing the handling in vcpu_enter_guest()
      exposed KVM to the rare case of a softirq triggering kernel_fpu_begin()
      between vcpu_load() and vcpu_enter_guest().
      
      Now that kvm_{load,put}_guest_fpu() correctly handle TIF_NEED_FPU_LOAD,
      revert the commit to both restore the vcpu_enter_guest() behavior and
      eliminate the superfluous switch_fpu_return() in kvm_arch_vcpu_load().
      
      Note, leaving the handling in kvm_arch_vcpu_load() isn't wrong per se,
      but it is unnecessary, and most critically, makes it extremely difficult
      to find bugs such as the kvm_put_guest_fpu() issue due to shrinking the
      window where a softirq can corrupt state.
      
      A sample trace triggered by warning if TIF_NEED_FPU_LOAD is set while
      vcpu state is loaded:
      
       <IRQ>
        gcmaes_crypt_by_sg.constprop.12+0x26e/0x660
        ? 0xffffffffc024547d
        ? __qdisc_run+0x83/0x510
        ? __dev_queue_xmit+0x45e/0x990
        ? ip_finish_output2+0x1a8/0x570
        ? fib4_rule_action+0x61/0x70
        ? fib4_rule_action+0x70/0x70
        ? fib_rules_lookup+0x13f/0x1c0
        ? helper_rfc4106_decrypt+0x82/0xa0
        ? crypto_aead_decrypt+0x40/0x70
        ? crypto_aead_decrypt+0x40/0x70
        ? crypto_aead_decrypt+0x40/0x70
        ? esp_output_tail+0x8f4/0xa5a [esp4]
        ? skb_ext_add+0xd3/0x170
        ? xfrm_input+0x7a6/0x12c0
        ? xfrm4_rcv_encap+0xae/0xd0
        ? xfrm4_transport_finish+0x200/0x200
        ? udp_queue_rcv_one_skb+0x1ba/0x460
        ? udp_unicast_rcv_skb.isra.63+0x72/0x90
        ? __udp4_lib_rcv+0x51b/0xb00
        ? ip_protocol_deliver_rcu+0xd2/0x1c0
        ? ip_local_deliver_finish+0x44/0x50
        ? ip_local_deliver+0xe0/0xf0
        ? ip_protocol_deliver_rcu+0x1c0/0x1c0
        ? ip_rcv+0xbc/0xd0
        ? ip_rcv_finish_core.isra.19+0x380/0x380
        ? __netif_receive_skb_one_core+0x7e/0x90
        ? netif_receive_skb_internal+0x3d/0xb0
        ? napi_gro_receive+0xed/0x150
        ? 0xffffffffc0243c77
        ? net_rx_action+0x149/0x3b0
        ? __do_softirq+0xe4/0x2f8
        ? handle_irq_event_percpu+0x6a/0x80
        ? irq_exit+0xe6/0xf0
        ? do_IRQ+0x7f/0xd0
        ? common_interrupt+0xf/0xf
        </IRQ>
        ? irq_entries_start+0x20/0x660
        ? vmx_get_interrupt_shadow+0x2f0/0x710 [kvm_intel]
        ? kvm_set_msr_common+0xfc7/0x2380 [kvm]
        ? recalibrate_cpu_khz+0x10/0x10
        ? ktime_get+0x3a/0xa0
        ? kvm_arch_vcpu_ioctl_run+0x107/0x560 [kvm]
        ? kvm_init+0x6bf/0xd00 [kvm]
        ? __seccomp_filter+0x7a/0x680
        ? do_vfs_ioctl+0xa4/0x630
        ? security_file_ioctl+0x32/0x50
        ? ksys_ioctl+0x60/0x90
        ? __x64_sys_ioctl+0x16/0x20
        ? do_syscall_64+0x5f/0x1a0
        ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
      ---[ end trace 9564a1ccad733a90 ]---
      
      This reverts commit e7517324.
      
      Fixes: e7517324
      
       ("KVM: X86: Fix fpu state crash in kvm guest")
      Reported-by: default avatarDerek Yerger <derek@djy.llc>
      Reported-by: default avatar <kernel@najdan.com>
      Cc: Wanpeng Li <wanpengli@tencent.com>
      Cc: Thomas Lambertz <mail@thomaslambertz.de>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dc887fb2
    • Sean Christopherson's avatar
      KVM: x86: Ensure guest's FPU state is loaded when accessing for emulation · b24909e9
      Sean Christopherson authored
      commit a7baead7 upstream.
      
      Lock the FPU regs and reload the current thread's FPU state, which holds
      the guest's FPU state, to the CPU registers if necessary prior to
      accessing guest FPU state as part of emulation.  kernel_fpu_begin() can
      be called from softirq context, therefore KVM must ensure softirqs are
      disabled (locking the FPU regs disables softirqs) when touching CPU FPU
      state.
      
      Note, for all intents and purposes this reverts commit 6ab0b9fe
      ("x86,kvm: remove KVM emulator get_fpu / put_fpu"), but at the time it
      was applied, removing get/put_fpu() was correct.  The re-introduction
      of {get,put}_fpu() is necessitated by the deferring of FPU state load.
      
      Fixes: 5f409e20
      
       ("x86/fpu: Defer FPU state load until return to userspace")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b24909e9
    • Sean Christopherson's avatar
      KVM: x86: Handle TIF_NEED_FPU_LOAD in kvm_{load,put}_guest_fpu() · 6d88948e
      Sean Christopherson authored
      commit c9aef3b8 upstream.
      
      Handle TIF_NEED_FPU_LOAD similar to how fpu__copy() handles the flag
      when duplicating FPU state to a new task struct.  TIF_NEED_FPU_LOAD can
      be set any time control is transferred out of KVM, be it voluntarily,
      e.g. if I/O is triggered during a KVM call to get_user_pages, or
      involuntarily, e.g. if softirq runs after an IRQ occurs.  Therefore,
      KVM must account for TIF_NEED_FPU_LOAD whenever it is (potentially)
      accessing CPU FPU state.
      
      Fixes: 5f409e20
      
       ("x86/fpu: Defer FPU state load until return to userspace")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6d88948e
    • Paolo Bonzini's avatar
      KVM: x86: fix overlap between SPTE_MMIO_MASK and generation · a8dd6917
      Paolo Bonzini authored
      commit 56871d44 upstream.
      
      The SPTE_MMIO_MASK overlaps with the bits used to track MMIO
      generation number.  A high enough generation number would overwrite the
      SPTE_SPECIAL_MASK region and cause the MMIO SPTE to be misinterpreted.
      
      Likewise, setting bits 52 and 53 would also cause an incorrect generation
      number to be read from the PTE, though this was partially mitigated by the
      (useless if it weren't for the bug) removal of SPTE_SPECIAL_MASK from
      the spte in get_mmio_spte_generation.  Drop that removal, and replace
      it with a compile-time assertion.
      
      Fixes: 6eeb4ef0
      
       ("KVM: x86: assign two bits to track SPTE kinds")
      Reported-by: default avatarBen Gardon <bgardon@google.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a8dd6917
    • Sean Christopherson's avatar
      KVM: x86: Free wbinvd_dirty_mask if vCPU creation fails · eea53c94
      Sean Christopherson authored
      commit 16be9dde upstream.
      
      Free the vCPU's wbinvd_dirty_mask if vCPU creation fails after
      kvm_arch_vcpu_init(), e.g. when installing the vCPU's file descriptor.
      Do the freeing by calling kvm_arch_vcpu_free() instead of open coding
      the freeing.  This adds a likely superfluous, but ultimately harmless,
      call to kvmclock_reset(), which only clears vcpu->arch.pv_time_enabled.
      Using kvm_arch_vcpu_free() allows for additional cleanup in the future.
      
      Fixes: f5f48ee1
      
       ("KVM: VMX: Execute WBINVD to keep data consistency with assigned devices")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      eea53c94
    • Sean Christopherson's avatar
      KVM: x86: Don't let userspace set host-reserved cr4 bits · 24655ce0
      Sean Christopherson authored
      commit b11306b5
      
       upstream.
      
      Calculate the host-reserved cr4 bits at runtime based on the system's
      capabilities (using logic similar to __do_cpuid_func()), and use the
      dynamically generated mask for the reserved bit check in kvm_set_cr4()
      instead using of the static CR4_RESERVED_BITS define.  This prevents
      userspace from "enabling" features in cr4 that are not supported by the
      system, e.g. by ignoring KVM_GET_SUPPORTED_CPUID and specifying a bogus
      CPUID for the vCPU.
      
      Allowing userspace to set unsupported bits in cr4 can lead to a variety
      of undesirable behavior, e.g. failed VM-Enter, and in general increases
      KVM's attack surface.  A crafty userspace can even abuse CR4.LA57 to
      induce an unchecked #GP on a WRMSR.
      
      On a platform without LA57 support:
      
        KVM_SET_CPUID2 // CPUID_7_0_ECX.LA57 = 1
        KVM_SET_SREGS  // CR4.LA57 = 1
        KVM_SET_MSRS   // KERNEL_GS_BASE = 0x0004000000000000
        KVM_RUN
      
      leads to a #GP when writing KERNEL_GS_BASE into hardware:
      
        unchecked MSR access error: WRMSR to 0xc0000102 (tried to write 0x0004000000000000)
        at rIP: 0xffffffffa00f239a (vmx_prepare_switch_to_guest+0x10a/0x1d0 [kvm_intel])
        Call Trace:
         kvm_arch_vcpu_ioctl_run+0x671/0x1c70 [kvm]
         kvm_vcpu_ioctl+0x36b/0x5d0 [kvm]
         do_vfs_ioctl+0xa1/0x620
         ksys_ioctl+0x66/0x70
         __x64_sys_ioctl+0x16/0x20
         do_syscall_64+0x4c/0x170
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7fc08133bf47
      
      Note, the above sequence fails VM-Enter due to invalid guest state.
      Userspace can allow VM-Enter to succeed (after the WRMSR #GP) by adding
      a KVM_SET_SREGS w/ CR4.LA57=0 after KVM_SET_MSRS, in which case KVM will
      technically leak the host's KERNEL_GS_BASE into the guest.  But, as
      KERNEL_GS_BASE is a userspace-defined value/address, the leak is largely
      benign as a malicious userspace would simply be exposing its own data to
      the guest, and attacking a benevolent userspace would require multiple
      bugs in the userspace VMM.
      
      Cc: stable@vger.kernel.org
      Cc: Jun Nakajima <jun.nakajima@intel.com>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      24655ce0
    • Sean Christopherson's avatar
      KVM: VMX: Add non-canonical check on writes to RTIT address MSRs · 684ea466
      Sean Christopherson authored
      commit fe6ed369
      
       upstream.
      
      Reject writes to RTIT address MSRs if the data being written is a
      non-canonical address as the MSRs are subject to canonical checks, e.g.
      KVM will trigger an unchecked #GP when loading the values to hardware
      during pt_guest_enter().
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      684ea466
    • Sean Christopherson's avatar
      KVM: x86: Use gpa_t for cr2/gpa to fix TDP support on 32-bit KVM · f9fb42b4
      Sean Christopherson authored
      commit 736c291c
      
       upstream.
      
      Convert a plethora of parameters and variables in the MMU and page fault
      flows from type gva_t to gpa_t to properly handle TDP on 32-bit KVM.
      
      Thanks to PSE and PAE paging, 32-bit kernels can access 64-bit physical
      addresses.  When TDP is enabled, the fault address is a guest physical
      address and thus can be a 64-bit value, even when both KVM and its guest
      are using 32-bit virtual addressing, e.g. VMX's VMCS.GUEST_PHYSICAL is a
      64-bit field, not a natural width field.
      
      Using a gva_t for the fault address means KVM will incorrectly drop the
      upper 32-bits of the GPA.  Ditto for gva_to_gpa() when it is used to
      translate L2 GPAs to L1 GPAs.
      
      Opportunistically rename variables and parameters to better reflect the
      dual address modes, e.g. use "cr2_or_gpa" for fault addresses and plain
      "addr" instead of "vaddr" when the address may be either a GVA or an L2
      GPA.  Similarly, use "gpa" in the nonpaging_page_fault() flows to avoid
      a confusing "gpa_t gva" declaration; this also sets the stage for a
      future patch to combing nonpaging_page_fault() and tdp_page_fault() with
      minimal churn.
      
      Sprinkle in a few comments to document flows where an address is known
      to be a GVA and thus can be safely truncated to a 32-bit value.  Add
      WARNs in kvm_handle_page_fault() and FNAME(gva_to_gpa_nested)() to help
      document such cases and detect bugs.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f9fb42b4
    • Boris Ostrovsky's avatar
      x86/KVM: Clean up host's steal time structure · 43ba65a5
      Boris Ostrovsky authored
      commit a6bd811f
      
       upstream.
      
      Now that we are mapping kvm_steal_time from the guest directly we
      don't need keep a copy of it in kvm_vcpu_arch.st. The same is true
      for the stime field.
      
      This is part of CVE-2019-3016.
      
      Signed-off-by: default avatarBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Reviewed-by: default avatarJoao Martins <joao.m.martins@oracle.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      43ba65a5
    • Boris Ostrovsky's avatar
      x86/kvm: Cache gfn to pfn translation · 07269066
      Boris Ostrovsky authored
      commit 91724814
      
       upstream.
      
      __kvm_map_gfn()'s call to gfn_to_pfn_memslot() is
      * relatively expensive
      * in certain cases (such as when done from atomic context) cannot be called
      
      Stashing gfn-to-pfn mapping should help with both cases.
      
      This is part of CVE-2019-3016.
      
      Signed-off-by: default avatarBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Reviewed-by: default avatarJoao Martins <joao.m.martins@oracle.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      07269066
    • Boris Ostrovsky's avatar
      x86/KVM: Make sure KVM_VCPU_FLUSH_TLB flag is not missed · 91fefc3e
      Boris Ostrovsky authored
      commit b0431382
      
       upstream.
      
      There is a potential race in record_steal_time() between setting
      host-local vcpu->arch.st.steal.preempted to zero (i.e. clearing
      KVM_VCPU_PREEMPTED) and propagating this value to the guest with
      kvm_write_guest_cached(). Between those two events the guest may
      still see KVM_VCPU_PREEMPTED in its copy of kvm_steal_time, set
      KVM_VCPU_FLUSH_TLB and assume that hypervisor will do the right
      thing. Which it won't.
      
      Instad of copying, we should map kvm_steal_time and that will
      guarantee atomicity of accesses to @preempted.
      
      This is part of CVE-2019-3016.
      
      Signed-off-by: default avatarBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Reviewed-by: default avatarJoao Martins <joao.m.martins@oracle.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      91fefc3e
    • Boris Ostrovsky's avatar
      x86/kvm: Introduce kvm_(un)map_gfn() · f6ad4449
      Boris Ostrovsky authored
      commit 1eff70a9
      
       upstream.
      
      kvm_vcpu_(un)map operates on gfns from any current address space.
      In certain cases we want to make sure we are not mapping SMRAM
      and for that we can use kvm_(un)map_gfn() that we are introducing
      in this patch.
      
      This is part of CVE-2019-3016.
      
      Signed-off-by: default avatarBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Reviewed-by: default avatarJoao Martins <joao.m.martins@oracle.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f6ad4449
    • Paolo Bonzini's avatar
      KVM: x86: use CPUID to locate host page table reserved bits · c6a896bb
      Paolo Bonzini authored
      commit 7adacf5e
      
       upstream.
      
      The comment in kvm_get_shadow_phys_bits refers to MKTME, but the same is actually
      true of SME and SEV.  Just use CPUID[0x8000_0008].EAX[7:0] unconditionally if
      available, it is simplest and works even if memory is not encrypted.
      
      Cc: stable@vger.kernel.org
      Reported-by: default avatarTom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c6a896bb
    • Boris Ostrovsky's avatar
      x86/kvm: Be careful not to clear KVM_VCPU_FLUSH_TLB bit · 06b34107
      Boris Ostrovsky authored
      commit 8c6de56a
      
       upstream.
      
      kvm_steal_time_set_preempted() may accidentally clear KVM_VCPU_FLUSH_TLB
      bit if it is called more than once while VCPU is preempted.
      
      This is part of CVE-2019-3016.
      
      (This bug was also independently discovered by Jim Mattson
      <jmattson@google.com>)
      
      Signed-off-by: default avatarBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Reviewed-by: default avatarJoao Martins <joao.m.martins@oracle.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      06b34107
    • Sean Christopherson's avatar
      KVM: x86/mmu: Apply max PA check for MMIO sptes to 32-bit KVM · 0972f0da
      Sean Christopherson authored
      commit e30a7d62 upstream.
      
      Remove the bogus 64-bit only condition from the check that disables MMIO
      spte optimization when the system supports the max PA, i.e. doesn't have
      any reserved PA bits.  32-bit KVM always uses PAE paging for the shadow
      MMU, and per Intel's SDM:
      
        PAE paging translates 32-bit linear addresses to 52-bit physical
        addresses.
      
      The kernel's restrictions on max physical addresses are limits on how
      much memory the kernel can reasonably use, not what physical addresses
      are supported by hardware.
      
      Fixes: ce88decf
      
       ("KVM: MMU: mmio page fault support")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0972f0da
    • John Allen's avatar
      kvm/svm: PKU not currently supported · b4daeb00
      John Allen authored
      commit a47970ed
      
       upstream.
      
      Current SVM implementation does not have support for handling PKU. Guests
      running on a host with future AMD cpus that support the feature will read
      garbage from the PKRU register and will hit segmentation faults on boot as
      memory is getting marked as protected that should not be. Ensure that cpuid
      from SVM does not advertise the feature.
      
      Signed-off-by: default avatarJohn Allen <john.allen@amd.com>
      Cc: stable@vger.kernel.org
      Fixes: 0556cbdc
      
       ("x86/pkeys: Don't check if PKRU is zero before writing it")
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b4daeb00
    • Sean Christopherson's avatar
      KVM: PPC: Book3S PR: Free shared page if mmu initialization fails · 18eebcc7
      Sean Christopherson authored
      commit cb10bf91 upstream.
      
      Explicitly free the shared page if kvmppc_mmu_init() fails during
      kvmppc_core_vcpu_create(), as the page is freed only in
      kvmppc_core_vcpu_free(), which is not reached via kvm_vcpu_uninit().
      
      Fixes: 96bc451a
      
       ("KVM: PPC: Introduce shared page")
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarGreg Kurz <groug@kaod.org>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Acked-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      18eebcc7
    • Sean Christopherson's avatar
      KVM: PPC: Book3S HV: Uninit vCPU if vcore creation fails · 89f61eb6
      Sean Christopherson authored
      commit 1a978d9d upstream.
      
      Call kvm_vcpu_uninit() if vcore creation fails to avoid leaking any
      resources allocated by kvm_vcpu_init(), i.e. the vcpu->run page.
      
      Fixes: 371fefd6
      
       ("KVM: PPC: Allow book3s_hv guests to use SMT processor modes")
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarGreg Kurz <groug@kaod.org>
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Acked-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      89f61eb6
    • Sean Christopherson's avatar
      KVM: x86: Fix potential put_fpu() w/o load_fpu() on MPX platform · fb42ab92
      Sean Christopherson authored
      commit f958bd23 upstream.
      
      Unlike most state managed by XSAVE, MPX is initialized to zero on INIT.
      Because INITs are usually recognized in the context of a VCPU_RUN call,
      kvm_vcpu_reset() puts the guest's FPU so that the FPU state is resident
      in memory, zeros the MPX state, and reloads FPU state to hardware.  But,
      in the unlikely event that an INIT is recognized during
      kvm_arch_vcpu_ioctl_get_mpstate() via kvm_apic_accept_events(),
      kvm_vcpu_reset() will call kvm_put_guest_fpu() without a preceding
      kvm_load_guest_fpu() and corrupt the guest's FPU state (and possibly
      userspace's FPU state as well).
      
      Given that MPX is being removed from the kernel[*], fix the bug with the
      simple-but-ugly approach of loading the guest's FPU during
      KVM_GET_MP_STATE.
      
      [*] See commit f240652b ("x86/mpx: Remove MPX APIs").
      
      Fixes: f775b13e
      
       ("x86,kvm: move qemu/guest FPU switching out to vcpu_run")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fb42ab92
    • Marios Pomonis's avatar
      KVM: x86: Protect MSR-based index computations in fixed_msr_to_seg_unit() from... · 638e8e13
      Marios Pomonis authored
      KVM: x86: Protect MSR-based index computations in fixed_msr_to_seg_unit() from Spectre-v1/L1TF attacks
      
      commit 25a5edea upstream.
      
      This fixes a Spectre-v1/L1TF vulnerability in fixed_msr_to_seg_unit().
      This function contains index computations based on the
      (attacker-controlled) MSR number.
      
      Fixes: de9aef5e
      
       ("KVM: MTRR: introduce fixed_mtrr_segment table")
      
      Signed-off-by: default avatarNick Finco <nifi@google.com>
      Signed-off-by: default avatarMarios Pomonis <pomonis@google.com>
      Reviewed-by: default avatarAndrew Honig <ahonig@google.com>
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarJim Mattson <jmattson@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      638e8e13
    • Marios Pomonis's avatar
      KVM: x86: Protect x86_decode_insn from Spectre-v1/L1TF attacks · bb9a9b51
      Marios Pomonis authored
      commit 3c9053a2 upstream.
      
      This fixes a Spectre-v1/L1TF vulnerability in x86_decode_insn().
      kvm_emulate_instruction() (an ancestor of x86_decode_insn()) is an exported
      symbol, so KVM should treat it conservatively from a security perspective.
      
      Fixes: 045a282c
      
       ("KVM: emulator: implement fninit, fnstsw, fnstcw")
      
      Signed-off-by: default avatarNick Finco <nifi@google.com>
      Signed-off-by: default avatarMarios Pomonis <pomonis@google.com>
      Reviewed-by: default avatarAndrew Honig <ahonig@google.com>
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarJim Mattson <jmattson@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bb9a9b51
    • Marios Pomonis's avatar
      KVM: x86: Protect MSR-based index computations from Spectre-v1/L1TF attacks in x86.c · 63d5ec6e
      Marios Pomonis authored
      commit 6ec4c5ee upstream.
      
      This fixes a Spectre-v1/L1TF vulnerability in set_msr_mce() and
      get_msr_mce().
      Both functions contain index computations based on the
      (attacker-controlled) MSR number.
      
      Fixes: 890ca9ae
      
       ("KVM: Add MCE support")
      
      Signed-off-by: default avatarNick Finco <nifi@google.com>
      Signed-off-by: default avatarMarios Pomonis <pomonis@google.com>
      Reviewed-by: default avatarAndrew Honig <ahonig@google.com>
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarJim Mattson <jmattson@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      63d5ec6e
    • Marios Pomonis's avatar
      KVM: x86: Protect ioapic_read_indirect() from Spectre-v1/L1TF attacks · 3a9e64af
      Marios Pomonis authored
      commit 8c86405f upstream.
      
      This fixes a Spectre-v1/L1TF vulnerability in ioapic_read_indirect().
      This function contains index computations based on the
      (attacker-controlled) IOREGSEL register.
      
      Fixes: a2c118bf
      
       ("KVM: Fix bounds checking in ioapic indirect register reads (CVE-2013-1798)")
      
      Signed-off-by: default avatarNick Finco <nifi@google.com>
      Signed-off-by: default avatarMarios Pomonis <pomonis@google.com>
      Reviewed-by: default avatarAndrew Honig <ahonig@google.com>
      Cc: stable@vger.kernel.org
      Reviewed-by: default avatarJim Mattson <jmattson@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3a9e64af