Skip to content
  1. Jul 19, 2018
  2. Jul 18, 2018
    • Peng Hao's avatar
      kvmclock: fix TSC calibration for nested guests · e10f7805
      Peng Hao authored
      
      
      Inside a nested guest, access to hardware can be slow enough that
      tsc_read_refs always return ULLONG_MAX, causing tsc_refine_calibration_work
      to be called periodically and the nested guest to spend a lot of time
      reading the ACPI timer.
      
      However, if the TSC frequency is available from the pvclock page,
      we can just set X86_FEATURE_TSC_KNOWN_FREQ and avoid the recalibration.
      'refine' operation.
      
      Suggested-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarPeng Hao <peng.hao2@zte.com.cn>
      [Commit message rewritten. - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e10f7805
    • Liran Alon's avatar
      KVM: VMX: Mark VMXArea with revision_id of physical CPU even when eVMCS enabled · 2307af1c
      Liran Alon authored
      
      
      When eVMCS is enabled, all VMCS allocated to be used by KVM are marked
      with revision_id of KVM_EVMCS_VERSION instead of revision_id reported
      by MSR_IA32_VMX_BASIC.
      
      However, even though not explictly documented by TLFS, VMXArea passed
      as VMXON argument should still be marked with revision_id reported by
      physical CPU.
      
      This issue was found by the following setup:
      * L0 = KVM which expose eVMCS to it's L1 guest.
      * L1 = KVM which consume eVMCS reported by L0.
      This setup caused the following to occur:
      1) L1 execute hardware_enable().
      2) hardware_enable() calls kvm_cpu_vmxon() to execute VMXON.
      3) L0 intercept L1 VMXON and execute handle_vmon() which notes
      vmxarea->revision_id != VMCS12_REVISION and therefore fails with
      nested_vmx_failInvalid() which sets RFLAGS.CF.
      4) L1 kvm_cpu_vmxon() don't check RFLAGS.CF for failure and therefore
      hardware_enable() continues as usual.
      5) L1 hardware_enable() then calls ept_sync_global() which executes
      INVEPT.
      6) L0 intercept INVEPT and execute handle_invept() which notes
      !vmx->nested.vmxon and thus raise a #UD to L1.
      7) Raised #UD caused L1 to panic.
      
      Reviewed-by: default avatarKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Cc: stable@vger.kernel.org
      Fixes: 773e8a04
      
      
      Signed-off-by: default avatarLiran Alon <liran.alon@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2307af1c
    • Paolo Bonzini's avatar
      KVM: irqfd: fix race between EPOLLHUP and irq_bypass_register_consumer · 9432a317
      Paolo Bonzini authored
      
      
      A comment warning against this bug is there, but the code is not doing what
      the comment says.  Therefore it is possible that an EPOLLHUP races against
      irq_bypass_register_consumer.  The EPOLLHUP handler schedules irqfd_shutdown,
      and if that runs soon enough, you get a use-after-free.
      
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      9432a317
    • Lan Tianyu's avatar
      KVM/Eventfd: Avoid crash when assign and deassign specific eventfd in parallel. · b5020a8e
      Lan Tianyu authored
      
      
      Syzbot reports crashes in kvm_irqfd_assign(), caused by use-after-free
      when kvm_irqfd_assign() and kvm_irqfd_deassign() run in parallel
      for one specific eventfd. When the assign path hasn't finished but irqfd
      has been added to kvm->irqfds.items list, another thead may deassign the
      eventfd and free struct kvm_kernel_irqfd(). The assign path then uses
      the struct kvm_kernel_irqfd that has been freed by deassign path. To avoid
      such issue, keep irqfd under kvm->irq_srcu protection after the irqfd
      has been added to kvm->irqfds.items list, and call synchronize_srcu()
      in irq_shutdown() to make sure that irqfd has been fully initialized in
      the assign path.
      
      Reported-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarTianyu Lan <tianyu.lan@intel.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b5020a8e
    • Linus Torvalds's avatar
      Mark HI and TASKLET softirq synchronous · 3c53776e
      Linus Torvalds authored
      Way back in 4.9, we committed 4cd13c21 ("softirq: Let ksoftirqd do
      its job"), and ever since we've had small nagging issues with it.  For
      example, we've had:
      
        1ff68820 ("watchdog: core: make sure the watchdog_worker is not deferred")
        8d5755b3 ("watchdog: softdog: fire watchdog even if softirqs do not get to run")
        217f6974
      
       ("net: busy-poll: allow preemption in sk_busy_loop()")
      
      all of which worked around some of the effects of that commit.
      
      The DVB people have also complained that the commit causes excessive USB
      URB latencies, which seems to be due to the USB code using tasklets to
      schedule USB traffic.  This seems to be an issue mainly when already
      living on the edge, but waiting for ksoftirqd to handle it really does
      seem to cause excessive latencies.
      
      Now Hanna Hawa reports that this issue isn't just limited to USB URB and
      DVB, but also causes timeout problems for the Marvell SoC team:
      
       "I'm facing kernel panic issue while running raid 5 on sata disks
        connected to Macchiatobin (Marvell community board with Armada-8040
        SoC with 4 ARMv8 cores of CA72) Raid 5 built with Marvell DMA engine
        and async_tx mechanism (ASYNC_TX_DMA [=y]); the DMA driver (mv_xor_v2)
        uses a tasklet to clean the done descriptors from the queue"
      
      The latency problem causes a panic:
      
        mv_xor_v2 f0400000.xor: dma_sync_wait: timeout!
        Kernel panic - not syncing: async_tx_quiesce: DMA error waiting for transaction
      
      We've discussed simply just reverting the original commit entirely, and
      also much more involved solutions (with per-softirq threads etc).  This
      patch is intentionally stupid and fairly limited, because the issue
      still remains, and the other solutions either got sidetracked or had
      other issues.
      
      We should probably also consider the timer softirqs to be synchronous
      and not be delayed to ksoftirqd (since they were the issue with the
      earlier watchdog problems), but that should be done as a separate patch.
      This does only the tasklet cases.
      
      Reported-and-tested-by: default avatarHanna Hawa <hannah@marvell.com>
      Reported-and-tested-by: default avatarJosef Griebichler <griebichler.josef@gmx.at>
      Reported-by: default avatarMauro Carvalho Chehab <mchehab@s-opensource.com>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3c53776e
  3. Jul 17, 2018
    • Qu Wenruo's avatar
      btrfs: scrub: Don't use inode page cache in scrub_handle_errored_block() · 665d4953
      Qu Wenruo authored
      In commit ac0b4145 ("btrfs: scrub: Don't use inode pages for device
      replace") we removed the branch of copy_nocow_pages() to avoid
      corruption for compressed nodatasum extents.
      
      However above commit only solves the problem in scrub_extent(), if
      during scrub_pages() we failed to read some pages,
      sctx->no_io_error_seen will be non-zero and we go to fixup function
      scrub_handle_errored_block().
      
      In scrub_handle_errored_block(), for sctx without csum (no matter if
      we're doing replace or scrub) we go to scrub_fixup_nodatasum() routine,
      which does the similar thing with copy_nocow_pages(), but does it
      without the extra check in copy_nocow_pages() routine.
      
      So for test cases like btrfs/100, where we emulate read errors during
      replace/scrub, we could corrupt compressed extent data again.
      
      This patch will fix it just by avoiding any "optimization" for
      nodatasum, just falls back to the normal fixup routine by try read from
      any good copy.
      
      This also solves WARN_ON() or dead lock caused by lame backref iteration
      in scrub_fixup_nodatasum() routine.
      
      The deadlock or WARN_ON() won't be triggered before commit ac0b4145
      ("btrfs: scrub: Don't use inode pages for device replace") since
      copy_nocow_pages() have better locking and extra check for data extent,
      and it's already doing the fixup work by try to read data from any good
      copy, so it won't go scrub_fixup_nodatasum() anyway.
      
      This patch disables the faulty code and will be removed completely in a
      followup patch.
      
      Fixes: ac0b4145
      
       ("btrfs: scrub: Don't use inode pages for device replace")
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      665d4953
    • Linus Torvalds's avatar
      Merge tag 'pinctrl-v4.18-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl · 30b06abf
      Linus Torvalds authored
      Pull pin control fixes from Linus Walleij:
      
       - A slew of driver fixes for Mediatek mt7622
      
       - Fix a direction inversion bug in the Ingenic driver
      
       - Fix unsupported drive strength setting on the PFC r8a77970
      
       - Off by one and NULL dereference fixes in the NSP driver
      
      * tag 'pinctrl-v4.18-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
        pinctrl: nsp: Fix potential NULL dereference
        pinctrl: nsp: off by ones in nsp_pinmux_enable()
        pinctrl: sh-pfc: r8a77970: remove SH_PFC_PIN_CFG_DRIVE_STRENGTH flag
        pinctrl: ingenic: Fix inverted direction for < JZ4770
        pinctrl: mt7622: fix a kernel panic when gpio-hog is being applied
        pinctrl: mt7622: stop using the deprecated pinctrl_add_gpio_range
        pinctrl: mt7622: fix that pinctrl_claim_hogs cannot work
        pinctrl: mt7622: fix initialization sequence between eint and gpiochip
        pinctrl: mt7622: fix error path on failing at groups building
      30b06abf
    • Linus Torvalds's avatar
      Merge tag 'drm-fixes-2018-07-16-1' of git://anongit.freedesktop.org/drm/drm · 706bf68b
      Linus Torvalds authored
      Pull drm fixes from Dave Airlie:
      
       - two AGP fixes in here
      
       - a bunch of mostly amdgpu fixes
      
       - sun4i build fix
      
       - two armada fixes
      
       - some tegra fixes
      
       - one i915 core and one i915 gvt fix
      
      * tag 'drm-fixes-2018-07-16-1' of git://anongit.freedesktop.org/drm/drm:
        drm/amdgpu/pp/smu7: use a local variable for toc indexing
        amd/dc/dce100: On dce100, set clocks to 0 on suspend
        drm/amd/display: Convert 10kHz clks from PPLib into kHz for Vega
        drm/amdgpu: Verify root PD is mapped into kernel address space (v4)
        drm/amd/display: fix invalid function table override
        drm/amdgpu: Reserve VM root shared fence slot for command submission (v3)
        Revert "drm/amd/display: Don't return ddc result and read_bytes in same return value"
        char: amd64-agp: Use 64-bit arithmetic instead of 32-bit
        char: agp: Change return type to vm_fault_t
        drm/i915: Fix hotplug irq ack on i965/g4x
        drm/armada: fix irq handling
        drm/armada: fix colorkey mode property
        drm/tegra: Fix comparison operator for buffer size
        gpu: host1x: Check whether size of unpin isn't 0
        gpu: host1x: Skip IOMMU initialization if firewall is enabled
        drm/sun4i: link in front-end code if needed
        drm/i915/gvt: update vreg on inhibit context lri command
      706bf68b
    • Pavel Tatashin's avatar
      mm: don't do zero_resv_unavail if memmap is not allocated · d1b47a7c
      Pavel Tatashin authored
      Moving zero_resv_unavail before memmap_init_zone(), caused a regression on
      x86-32.
      
      The cause is that we access struct pages before they are allocated when
      CONFIG_FLAT_NODE_MEM_MAP is used.
      
      free_area_init_nodes()
        zero_resv_unavail()
          mm_zero_struct_page(pfn_to_page(pfn)); <- struct page is not alloced
        free_area_init_node()
          if CONFIG_FLAT_NODE_MEM_MAP
            alloc_node_mem_map()
              memblock_virt_alloc_node_nopanic() <- struct page alloced here
      
      On the other hand memblock_virt_alloc_node_nopanic() zeroes all the memory
      that it returns, so we do not need to do zero_resv_unavail() here.
      
      Fixes: e181ae0c
      
       ("mm: zero unavailable pages before memmap init")
      Signed-off-by: default avatarPavel Tatashin <pasha.tatashin@oracle.com>
      Tested-by: default avatarMatt Hart <matt@mattface.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d1b47a7c
  4. Jul 16, 2018
  5. Jul 15, 2018