Skip to content
  1. May 08, 2015
  2. May 07, 2015
    • Paolo Bonzini's avatar
      KVM: x86: dump VMCS on invalid entry · 4eb64dce
      Paolo Bonzini authored
      
      
      Code and format roughly based on Xen's vmcs_dump_vcpu.
      
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4eb64dce
    • Marcelo Tosatti's avatar
      x86: kvmclock: drop rdtsc_barrier() · a3eb97bd
      Marcelo Tosatti authored
      Drop unnecessary rdtsc_barrier(), as has been determined empirically,
      see 057e6a8c
      
       for details.
      
      Noticed by Andy Lutomirski.
      
      Improves clock_gettime() by approximately 15% on
      Intel i7-3520M @ 2.90GHz.
      
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a3eb97bd
    • Julia Lawall's avatar
      KVM: x86: drop unneeded null test · d90e3a35
      Julia Lawall authored
      
      
      If the null test is needed, the call to cancel_delayed_work_sync would have
      already crashed.  Normally, the destroy function should only be called
      if the init function has succeeded, in which case ioapic is not null.
      
      Problem found using Coccinelle.
      
      Suggested-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarJulia Lawall <Julia.Lawall@lip6.fr>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d90e3a35
    • Radim Krčmář's avatar
      KVM: x86: fix initial PAT value · 74545705
      Radim Krčmář authored
      
      
      PAT should be 0007_0406_0007_0406h on RESET and not modified on INIT.
      VMX used a wrong value (host's PAT) and while SVM used the right one,
      it never got to arch.pat.
      
      This is not an issue with QEMU as it will force the correct value.
      
      Signed-off-by: default avatarRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      74545705
    • Rik van Riel's avatar
      kvm,x86: load guest FPU context more eagerly · 653f52c3
      Rik van Riel authored
      
      
      Currently KVM will clear the FPU bits in CR0.TS in the VMCS, and trap to
      re-load them every time the guest accesses the FPU after a switch back into
      the guest from the host.
      
      This patch copies the x86 task switch semantics for FPU loading, with the
      FPU loaded eagerly after first use if the system uses eager fpu mode,
      or if the guest uses the FPU frequently.
      
      In the latter case, after loading the FPU for 255 times, the fpu_counter
      will roll over, and we will revert to loading the FPU on demand, until
      it has been established that the guest is still actively using the FPU.
      
      This mirrors the x86 task switch policy, which seems to work.
      
      Signed-off-by: default avatarRik van Riel <riel@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      653f52c3
    • James Sullivan's avatar
      kvm: x86: Deliver MSI IRQ to only lowest prio cpu if msi_redir_hint is true · d1ebdbf9
      James Sullivan authored
      
      
      An MSI interrupt should only be delivered to the lowest priority CPU
      when it has RH=1, regardless of the delivery mode. Modified
      kvm_is_dm_lowest_prio() to check for either irq->delivery_mode == APIC_DM_LOWPRI
      or irq->msi_redir_hint.
      
      Moved kvm_is_dm_lowest_prio() into lapic.h and renamed to
      kvm_lowest_prio_delivery().
      
      Changed a check in kvm_irq_delivery_to_apic_fast() from
      irq->delivery_mode == APIC_DM_LOWPRI to kvm_is_dm_lowest_prio().
      
      Signed-off-by: default avatarJames Sullivan <sullivan.james.f@gmail.com>
      Reviewed-by: default avatarRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d1ebdbf9
    • James Sullivan's avatar
      kvm: x86: Extended struct kvm_lapic_irq with msi_redir_hint for MSI delivery · 93bbf0b8
      James Sullivan authored
      
      
      Extended struct kvm_lapic_irq with bool msi_redir_hint, which will
      be used to determine if the delivery of the MSI should target only
      the lowest priority CPU in the logical group specified for delivery.
      (In physical dest mode, the RH bit is not relevant). Initialized the value
      of msi_redir_hint to true when RH=1 in kvm_set_msi_irq(), and initialized
      to false in all other cases.
      
      Added value of msi_redir_hint to a debug message dump of an IRQ in
      apic_send_ipi().
      
      Signed-off-by: default avatarJames Sullivan <sullivan.james.f@gmail.com>
      Reviewed-by: default avatarRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      93bbf0b8
    • Paolo Bonzini's avatar
      KVM: x86: tweak types of fields in kvm_lapic_irq · b7cb2231
      Paolo Bonzini authored
      
      
      Change to u16 if they only contain data in the low 16 bits.
      
      Change the level field to bool, since we assign 1 sometimes, but
      just mask icr_low with APIC_INT_ASSERT in apic_send_ipi.
      
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b7cb2231
    • Nadav Amit's avatar
      KVM: x86: INIT and reset sequences are different · d28bc9dd
      Nadav Amit authored
      
      
      x86 architecture defines differences between the reset and INIT sequences.
      INIT does not initialize the FPU (including MMX, XMM, YMM, etc.), TSC, PMU,
      MSRs (in general), MTRRs machine-check, APIC ID, APIC arbitration ID and BSP.
      
      References (from Intel SDM):
      
      "If the MP protocol has completed and a BSP is chosen, subsequent INITs (either
      to a specific processor or system wide) do not cause the MP protocol to be
      repeated." [8.4.2: MP Initialization Protocol Requirements and Restrictions]
      
      [Table 9-1. IA-32 Processor States Following Power-up, Reset, or INIT]
      
      "If the processor is reset by asserting the INIT# pin, the x87 FPU state is not
      changed." [9.2: X87 FPU INITIALIZATION]
      
      "The state of the local APIC following an INIT reset is the same as it is after
      a power-up or hardware reset, except that the APIC ID and arbitration ID
      registers are not affected." [10.4.7.3: Local APIC State After an INIT Reset
      ("Wait-for-SIPI" State)]
      
      Signed-off-by: default avatarNadav Amit <namit@cs.technion.ac.il>
      Message-Id: <1428924848-28212-1-git-send-email-namit@cs.technion.ac.il>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d28bc9dd
    • Nadav Amit's avatar
      KVM: x86: Support for disabling quirks · 90de4a18
      Nadav Amit authored
      
      
      Introducing KVM_CAP_DISABLE_QUIRKS for disabling x86 quirks that were previous
      created in order to overcome QEMU issues. Those issue were mostly result of
      invalid VM BIOS.  Currently there are two quirks that can be disabled:
      
      1. KVM_QUIRK_LINT0_REENABLED - LINT0 was enabled after boot
      2. KVM_QUIRK_CD_NW_CLEARED - CD and NW are cleared after boot
      
      These two issues are already resolved in recent releases of QEMU, and would
      therefore be disabled by QEMU.
      
      Signed-off-by: default avatarNadav Amit <namit@cs.technion.ac.il>
      Message-Id: <1428879221-29996-1-git-send-email-namit@cs.technion.ac.il>
      [Report capability from KVM_CHECK_EXTENSION too. - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      90de4a18
    • Paolo Bonzini's avatar
      KVM: booke: use __kvm_guest_exit · e233d54d
      Paolo Bonzini authored
      
      
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e233d54d
    • Christian Borntraeger's avatar
      KVM: arm/mips/x86/power use __kvm_guest_{enter|exit} · ccf73aaf
      Christian Borntraeger authored
      
      
      Use __kvm_guest_{enter|exit} instead of kvm_guest_{enter|exit}
      where interrupts are disabled.
      
      Signed-off-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ccf73aaf
    • Christian Borntraeger's avatar
      KVM: provide irq_unsafe kvm_guest_{enter|exit} · 0097d12e
      Christian Borntraeger authored
      
      
      Several kvm architectures disable interrupts before kvm_guest_enter.
      kvm_guest_enter then uses local_irq_save/restore to disable interrupts
      again or for the first time. Lets provide underscore versions of
      kvm_guest_{enter|exit} that assume being called locked.
      kvm_guest_enter now disables interrupts for the full function and
      thus we can remove the check for preemptible.
      
      This patch then adopts s390/kvm to use local_irq_disable/enable calls
      which are slighty cheaper that local_irq_save/restore and call these
      new functions.
      
      Signed-off-by: default avatarChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0097d12e
    • Luiz Capitulino's avatar
      kvmclock: set scheduler clock stable · ff7bbb9c
      Luiz Capitulino authored
      
      
      If you try to enable NOHZ_FULL on a guest today, you'll get
      the following error when the guest tries to deactivate the
      scheduler tick:
      
       WARNING: CPU: 3 PID: 2182 at kernel/time/tick-sched.c:192 can_stop_full_tick+0xb9/0x290()
       NO_HZ FULL will not work with unstable sched clock
       CPU: 3 PID: 2182 Comm: kworker/3:1 Not tainted 4.0.0-10545-gb9bb6fb #204
       Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
       Workqueue: events flush_to_ldisc
        ffffffff8162a0c7 ffff88011f583e88 ffffffff814e6ba0 0000000000000002
        ffff88011f583ed8 ffff88011f583ec8 ffffffff8104d095 ffff88011f583eb8
        0000000000000000 0000000000000003 0000000000000001 0000000000000001
       Call Trace:
        <IRQ>  [<ffffffff814e6ba0>] dump_stack+0x4f/0x7b
        [<ffffffff8104d095>] warn_slowpath_common+0x85/0xc0
        [<ffffffff8104d146>] warn_slowpath_fmt+0x46/0x50
        [<ffffffff810bd2a9>] can_stop_full_tick+0xb9/0x290
        [<ffffffff810bd9ed>] tick_nohz_irq_exit+0x8d/0xb0
        [<ffffffff810511c5>] irq_exit+0xc5/0x130
        [<ffffffff814f180a>] smp_apic_timer_interrupt+0x4a/0x60
        [<ffffffff814eff5e>] apic_timer_interrupt+0x6e/0x80
        <EOI>  [<ffffffff814ee5d1>] ? _raw_spin_unlock_irqrestore+0x31/0x60
        [<ffffffff8108bbc8>] __wake_up+0x48/0x60
        [<ffffffff8134836c>] n_tty_receive_buf_common+0x49c/0xba0
        [<ffffffff8134a6bf>] ? tty_ldisc_ref+0x1f/0x70
        [<ffffffff81348a84>] n_tty_receive_buf2+0x14/0x20
        [<ffffffff8134b390>] flush_to_ldisc+0xe0/0x120
        [<ffffffff81064d05>] process_one_work+0x1d5/0x540
        [<ffffffff81064c81>] ? process_one_work+0x151/0x540
        [<ffffffff81065191>] worker_thread+0x121/0x470
        [<ffffffff81065070>] ? process_one_work+0x540/0x540
        [<ffffffff8106b4df>] kthread+0xef/0x110
        [<ffffffff8106b3f0>] ? __kthread_parkme+0xa0/0xa0
        [<ffffffff814ef4f2>] ret_from_fork+0x42/0x70
        [<ffffffff8106b3f0>] ? __kthread_parkme+0xa0/0xa0
       ---[ end trace 06e3507544a38866 ]---
      
      However, it turns out that kvmclock does provide a stable
      sched_clock callback. So, let the scheduler know this which
      in turn makes NOHZ_FULL work in the guest.
      
      Signed-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: default avatarLuiz Capitulino <lcapitulino@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ff7bbb9c
  3. May 04, 2015
  4. May 03, 2015
    • Jan Kara's avatar
      ext4: fix growing of tiny filesystems · 2c869b26
      Jan Kara authored
      
      
      The estimate of necessary transaction credits in ext4_flex_group_add()
      is too pessimistic. It reserves credit for sb, resize inode, and resize
      inode dindirect block for each group added in a flex group although they
      are always the same block and thus it is enough to account them only
      once. Also the number of modified GDT block is overestimated since we
      fit EXT4_DESC_PER_BLOCK(sb) descriptors in one block.
      
      Make the estimation more precise. That reduces number of requested
      credits enough that we can grow 20 MB filesystem (which has 1 MB
      journal, 79 reserved GDT blocks, and flex group size 16 by default).
      
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: default avatarEric Sandeen <sandeen@redhat.com>
      2c869b26
    • Davide Italiano's avatar
      ext4: move check under lock scope to close a race. · 280227a7
      Davide Italiano authored
      
      
      fallocate() checks that the file is extent-based and returns
      EOPNOTSUPP in case is not. Other tasks can convert from and to
      indirect and extent so it's safe to check only after grabbing
      the inode mutex.
      
      Signed-off-by: default avatarDavide Italiano <dccitaliano@gmail.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      280227a7
    • Lukas Czerner's avatar
      ext4: fix data corruption caused by unwritten and delayed extents · d2dc317d
      Lukas Czerner authored
      
      
      Currently it is possible to lose whole file system block worth of data
      when we hit the specific interaction with unwritten and delayed extents
      in status extent tree.
      
      The problem is that when we insert delayed extent into extent status
      tree the only way to get rid of it is when we write out delayed buffer.
      However there is a limitation in the extent status tree implementation
      so that when inserting unwritten extent should there be even a single
      delayed block the whole unwritten extent would be marked as delayed.
      
      At this point, there is no way to get rid of the delayed extents,
      because there are no delayed buffers to write out. So when a we write
      into said unwritten extent we will convert it to written, but it still
      remains delayed.
      
      When we try to write into that block later ext4_da_map_blocks() will set
      the buffer new and delayed and map it to invalid block which causes
      the rest of the block to be zeroed loosing already written data.
      
      For now we can fix this by simply not allowing to set delayed status on
      written extent in the extent status tree. Also add WARN_ON() to make
      sure that we notice if this happens in the future.
      
      This problem can be easily reproduced by running the following xfs_io.
      
      xfs_io -f -c "pwrite -S 0xaa 4096 2048" \
                -c "falloc 0 131072" \
                -c "pwrite -S 0xbb 65536 2048" \
                -c "fsync" /mnt/test/fff
      
      echo 3 > /proc/sys/vm/drop_caches
      xfs_io -c "pwrite -S 0xdd 67584 2048" /mnt/test/fff
      
      This can be theoretically also reproduced by at random by running fsx,
      but it's not very reliable, though on machines with bigger page size
      (like ppc) this can be seen more often (especially xfstest generic/127)
      
      Signed-off-by: default avatarLukas Czerner <lczerner@redhat.com>
      Signed-off-by: default avatarTheodore Ts'o <tytso@mit.edu>
      Cc: stable@vger.kernel.org
      d2dc317d
  5. May 02, 2015
  6. May 01, 2015
    • Linus Torvalds's avatar
      Merge branch 'for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs · 64887b68
      Linus Torvalds authored
      Pull btrfs fixes from Chris Mason:
       "A few more btrfs fixes.
      
        These range from corners Filipe found in the new free space cache
        writeback to a grab bag of fixes from the list"
      
      * 'for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
        Btrfs: btrfs_release_extent_buffer_page didn't free pages of dummy extent
        Btrfs: fill ->last_trans for delayed inode in btrfs_fill_inode.
        btrfs: unlock i_mutex after attempting to delete subvolume during send
        btrfs: check io_ctl_prepare_pages return in __btrfs_write_out_cache
        btrfs: fix race on ENOMEM in alloc_extent_buffer
        btrfs: handle ENOMEM in btrfs_alloc_tree_block
        Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole
        Btrfs: don't check for delalloc_bytes in cache_save_setup
        Btrfs: fix deadlock when starting writeback of bg caches
        Btrfs: fix race between start dirty bg cache writeout and bg deletion
      64887b68
    • Linus Torvalds's avatar
      Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux · 036f351e
      Linus Torvalds authored
      Pull arm64 fixes from Will Deacon:
       "Not too much here, but we've addressed a couple of nasty issues in the
        dma-mapping code as well as adding the halfword and byte variants of
        load_acquire/store_release following on from the CSD locking bug that
        you fixed in the core.
      
         - fix perf devicetree warnings at probe time
      
         - fix memory leak in __dma_free()
      
         - ensure DMA buffers are always zeroed
      
         - show IRQ trigger in /proc/interrupts (for parity with ARM)
      
         - implement byte and halfword access for smp_{load_acquire,store_release}"
      
      * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
        arm64: perf: Fix the pmu node name in warning message
        arm64: perf: don't warn about missing interrupt-affinity property for PPIs
        arm64: add missing PAGE_ALIGN() to __dma_free()
        arm64: dma-mapping: always clear allocated buffers
        ARM64: Enable CONFIG_GENERIC_IRQ_SHOW_LEVEL
        arm64: add missing data types in smp_load_acquire/smp_store_release
      036f351e