Skip to content
  1. Jun 04, 2015
    • Ingo Molnar's avatar
      x86/asm/entry, x86/vdso: Move the vDSO code to arch/x86/entry/vdso/ · d603c8e1
      Ingo Molnar authored
      
      
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      d603c8e1
    • Ingo Molnar's avatar
      x86/asm/entry: Move the compat syscall entry code to arch/x86/entry/ · 19a433f4
      Ingo Molnar authored
      
      
      Move the ia32entry.S file over into arch/x86/entry/.
      
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      19a433f4
    • Ingo Molnar's avatar
      x86/asm/entry: Move entry_64.S and entry_32.S to arch/x86/entry/ · 905a36a2
      Ingo Molnar authored
      
      
      Create a new directory hierarchy for the low level x86 entry code:
      
          arch/x86/entry/*
      
      This will host all the low level glue that is currently scattered
      all across arch/x86/.
      
      Start with entry_64.S and entry_32.S.
      
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      905a36a2
  2. Jun 02, 2015
    • Jan Beulich's avatar
      x86/asm/entry/64: Fold identical code paths · 2f63b9db
      Jan Beulich authored
      
      
      retint_kernel doesn't require %rcx to be pointing to thread info
      (anymore?), and the code on the two alternative paths is - not
      really surprisingly - identical.
      
      Signed-off-by: default avatarJan Beulich <jbeulich@suse.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/556C664F020000780007FB64@mail.emea.novell.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      2f63b9db
    • Jan Beulich's avatar
      x86/asm/entry/64: Use negative immediates for stack adjustments · 2bf557ea
      Jan Beulich authored
      
      
      Doing so allows adjustments by 128 bytes (occurring for
      REMOVE_PT_GPREGS_FROM_STACK 8 uses) to be expressed with a
      single byte immediate.
      
      Signed-off-by: default avatarJan Beulich <jbeulich@suse.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/556C660F020000780007FB60@mail.emea.novell.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      2bf557ea
    • Ingo Molnar's avatar
      x86/debug: Remove perpetually broken, unmaintainable dwarf annotations · 131484c8
      Ingo Molnar authored
      
      
      So the dwarf2 annotations in low level assembly code have
      become an increasing hindrance: unreadable, messy macros
      mixed into some of the most security sensitive code paths
      of the Linux kernel.
      
      These debug info annotations don't even buy the upstream
      kernel anything: dwarf driven stack unwinding has caused
      problems in the past so it's out of tree, and the upstream
      kernel only uses the much more robust framepointers based
      stack unwinding method.
      
      In addition to that there's a steady, slow bitrot going
      on with these annotations, requiring frequent fixups.
      There's no tooling and no functionality upstream that
      keeps it correct.
      
      So burn down the sick forest, allowing new, healthier growth:
      
         27 files changed, 350 insertions(+), 1101 deletions(-)
      
      Someone who has the willingness and time to do this
      properly can attempt to reintroduce dwarf debuginfo in x86
      assembly code plus dwarf unwinding from first principles,
      with the following conditions:
      
       - it should be maximally readable, and maximally low-key to
         'ordinary' code reading and maintenance.
      
       - find a build time method to insert dwarf annotations
         automatically in the most common cases, for pop/push
         instructions that manipulate the stack pointer. This could
         be done for example via a preprocessing step that just
         looks for common patterns - plus special annotations for
         the few cases where we want to depart from the default.
         We have hundreds of CFI annotations, so automating most of
         that makes sense.
      
       - it should come with build tooling checks that ensure that
         CFI annotations are sensible. We've seen such efforts from
         the framepointer side, and there's no reason it couldn't be
         done on the dwarf side.
      
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Frédéric Weisbecker <fweisbec@gmail.com
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jan Beulich <JBeulich@suse.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      131484c8
  3. May 24, 2015
    • Andy Lutomirski's avatar
      x86/asm/irq: Stop relying on magic JMP behavior for early_idt_handlers · cdeb6048
      Andy Lutomirski authored
      
      
      The early_idt_handlers asm code generates an array of entry
      points spaced nine bytes apart.  It's not really clear from that
      code or from the places that reference it what's going on, and
      the code only works in the first place because GAS never
      generates two-byte JMP instructions when jumping to global
      labels.
      
      Clean up the code to generate the correct array stride (member size)
      explicitly. This should be considerably more robust against
      screw-ups, as GAS will warn if a .fill directive has a negative
      count.  Using '. =' to advance would have been even more robust
      (it would generate an actual error if it tried to move
      backwards), but it would pad with nulls, confusing anyone who
      tries to disassemble the code.  The new scheme should be much
      clearer to future readers.
      
      While we're at it, improve the comments and rename the array and
      common code.
      
      Binutils may start relaxing jumps to non-weak labels.  If so,
      this change will fix our build, and we may need to backport this
      change.
      
      Before, on x86_64:
      
        0000000000000000 <early_idt_handlers>:
           0:   6a 00                   pushq  $0x0
           2:   6a 00                   pushq  $0x0
           4:   e9 00 00 00 00          jmpq   9 <early_idt_handlers+0x9>
                                5: R_X86_64_PC32        early_idt_handler-0x4
        ...
          48:   66 90                   xchg   %ax,%ax
          4a:   6a 08                   pushq  $0x8
          4c:   e9 00 00 00 00          jmpq   51 <early_idt_handlers+0x51>
                                4d: R_X86_64_PC32       early_idt_handler-0x4
        ...
         117:   6a 00                   pushq  $0x0
         119:   6a 1f                   pushq  $0x1f
         11b:   e9 00 00 00 00          jmpq   120 <early_idt_handler>
                                11c: R_X86_64_PC32      early_idt_handler-0x4
      
      After:
      
        0000000000000000 <early_idt_handler_array>:
           0:   6a 00                   pushq  $0x0
           2:   6a 00                   pushq  $0x0
           4:   e9 14 01 00 00          jmpq   11d <early_idt_handler_common>
        ...
          48:   6a 08                   pushq  $0x8
          4a:   e9 d1 00 00 00          jmpq   120 <early_idt_handler_common>
          4f:   cc                      int3
          50:   cc                      int3
        ...
         117:   6a 00                   pushq  $0x0
         119:   6a 1f                   pushq  $0x1f
         11b:   eb 03                   jmp    120 <early_idt_handler_common>
         11d:   cc                      int3
         11e:   cc                      int3
         11f:   cc                      int3
      
      Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
      Acked-by: default avatarH. Peter Anvin <hpa@linux.intel.com>
      Cc: Binutils <binutils@sourceware.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H.J. Lu <hjl.tools@gmail.com>
      Cc: Jan Beulich <JBeulich@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/ac027962af343b0c599cbfcf50b945ad2ef3d7a8.1432336324.git.luto@kernel.org
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      cdeb6048
  4. May 17, 2015
    • Denys Vlasenko's avatar
      x86/asm/entry/64: Use shorter MOVs from segment registers · adeb5537
      Denys Vlasenko authored
      
      
      The "movw %ds,%cx" instruction needs a 0x66 prefix, while
      "movl %ds,%ecx" does not.
      
      The difference is that latter form (on 64-bit CPUs)
      overwrites the entire %ecx, not only its lower half.
      
      But subsequent code doesn't depend on the value of upper
      half of %ecx, so we can safely use the shorter instruction.
      
      The new code is also faster than the old one - now we don't
      depend on the old value of %ecx, but this code fragment is
      not performance-critical so it does not matter much.
      
      Signed-off-by: default avatarDenys Vlasenko <dvlasenk@redhat.com>
      Cc: Alexei Starovoitov <ast@plumgrid.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Drewry <wad@chromium.org>
      Link: http://lkml.kernel.org/r/1431722346-26585-1-git-send-email-dvlasenk@redhat.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      adeb5537
    • Borislav Petkov's avatar
      x86/asm/head*.S: Change global labels to local · e839004b
      Borislav Petkov authored
      
      
      Make the disassembly look less confusing:
      
        -- head_64.o.before.asm
        ++ head_64.o.after.asm
         0000000000000120 <early_idt_handler>:
          120:	fc                   	cld
          121:	83 3c 24 02          	cmpl   $0x2,(%rsp)
        - 125:	0f 84 9d 00 00 00    	je     1c8 <is_nmi>
        + 125:	0f 84 9d 00 00 00    	je     1c8 <early_idt_handler+0xa8>
          12b:	83 3d 00 00 00 00 02 	cmpl   $0x2,0x0(%rip)        # 132 <early_idt_handler+0x12>
          132:	74 7e                	je     1b2 <early_idt_handler+0x92>
          134:	ff 05 00 00 00 00    	incl   0x0(%rip)        # 13a <early_idt_handler+0x1a>
        @@ -1198,9 +1198,7 @@ Disassembly of section .init.text:
          1bf:	5a                   	pop    %rdx
          1c0:	59                   	pop    %rcx
          1c1:	58                   	pop    %rax
        - 1c2:	ff 0d 00 00 00 00    	decl   0x0(%rip)        # 1c8 <is_nmi>
        -
        -00000000000001c8 <is_nmi>:
        + 1c2:	ff 0d 00 00 00 00    	decl   0x0(%rip)        # 1c8 <early_idt_handler+0xa8>
          1c8:	48 83 c4 10          	add    $0x10,%rsp
          1cc:	48 cf                	iretq
      
        -- head_32.o.before.asm
        ++ head_32.o.after.asm
         0000016c <early_idt_handler>:
          16c:  fc                      cld
          16d:  83 3c 24 02             cmpl   $0x2,(%esp)
        - 171:  74 73                   je     1e6 <is_nmi>
        + 171:  74 73                   je     1e6 <ex_entry+0xc>
          173:  36 83 3d 00 00 00 00    cmpl   $0x2,%ss:0x0
          17a:  02
          17b:  74 5a                   je     1d7 <hlt_loop>
        @@ -483,8 +483,6 @@ Disassembly of section .init.text:
          1dd:  59                      pop    %ecx
          1de:  58                      pop    %eax
          1df:  36 ff 0d 00 00 00 00    decl   %ss:0x0
        -
        -000001e6 <is_nmi>:
          1e6:  83 c4 08                add    $0x8,%esp
          1e9:  cf                      iret
          1ea:  66 90                   xchg   %ax,%ax
      
      No functionality change.
      
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1431793079-11153-1-git-send-email-bp@alien8.de
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      e839004b
    • Ingo Molnar's avatar
      x86: Pack loops tightly as well · 52648e83
      Ingo Molnar authored
      Packing loops tightly (-falign-loops=1) is beneficial to code size:
      
           text        data    bss     dec              filename
       12566391        1617840 1089536 15273767         vmlinux.align.16-byte
       12224951        1617840 1089536 14932327         vmlinux.align.1-byte
       11976567        1617840 1089536 14683943         vmlinux.align.1-byte.funcs-1-byte
       11903735        1617840 1089536 14611111         vmlinux.align.1-byte.funcs-1-byte.loops-1-byte
      
      Which reduces the size of the kernel by another 0.6%, so the
      the total combined size reduction of the alignment-packing
      patches is ~5.5%.
      
      The x86 decoder bandwidth and caching arguments laid out in:
      
        be6cb027
      
       ("x86: Align jump targets to 1-byte boundaries")
      
      apply to loop alignment as well.
      
      Furtermore, modern CPU uarchs have a loop cache/buffer that
      is a L0 cache before even any uop cache, covering a few
      dozen most recently executed instructions.
      
      This loop cache generally does not have the 16-byte alignment
      restrictions of the uop cache.
      
      Now loop alignment can still be beneficial if:
      
       - a loop is cache-hot and its surroundings are not.
      
       - if the loop is so cache hot that the instruction
         flow becomes x86 decoder bandwidth limited
      
      But loop alignment is harmful if:
      
       - a loop is cache-cold
      
       - a loop's surroundings are cache-hot as well
      
       - two cache-hot loops are close to each other
      
       - if the loop fits into the loop cache
      
       - if the code flow is not decoder bandwidth limited
      
      and I'd argue that the latter five scenarios are much
      more common in the kernel, as our hottest loops are
      typically:
      
       - pointer chasing: this should fit into the loop cache
         in most cases and is typically data cache and address
         generation limited
      
       - generic memory ops (memset, memcpy, etc.): these generally
         fit into the loop cache as well, and are likewise data
         cache limited.
      
      So this patch packs loop addresses tightly as well.
      
      Acked-by: default avatarDenys Vlasenko <dvlasenk@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Aswin Chandramouleeswaran <aswin@hp.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jason Low <jason.low2@hp.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Link: http://lkml.kernel.org/r/20150410123017.GB19918@gmail.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      52648e83
  5. May 15, 2015
    • Ingo Molnar's avatar
      x86: Align jump targets to 1-byte boundaries · be6cb027
      Ingo Molnar authored
      
      
      The following NOP in a hot function caught my attention:
      
        >   5a:	66 0f 1f 44 00 00    	nopw   0x0(%rax,%rax,1)
      
      That's a dead NOP that bloats the function a bit, added for the
      default 16-byte alignment that GCC applies for jump targets.
      
      I realize that x86 CPU manufacturers recommend 16-byte jump
      target alignments (it's in the Intel optimization manual),
      to help their relatively narrow decoder prefetch alignment
      and uop cache constraints, but the cost of that is very
      significant:
      
              text           data       bss         dec      filename
          12566391        1617840   1089536    15273767      vmlinux.align.16-byte
          12224951        1617840   1089536    14932327      vmlinux.align.1-byte
      
      By using 1-byte jump target alignment (i.e. no alignment at all)
      we get an almost 3% reduction in kernel size (!) - and a
      probably similar reduction in I$ footprint.
      
      Now, the usual justification for jump target alignment is the
      following:
      
       - modern decoders tend to have 16-byte (effective) decoder
         prefetch windows. (AMD documents it higher but measurements
         suggest the effective prefetch window on curretn uarchs is
         still around 16 bytes)
      
       - on Intel there's also the uop-cache with cachelines that have
         16-byte granularity and limited associativity.
      
       - older x86 uarchs had a penalty for decoder fetches that crossed
         16-byte boundaries. These limits are mostly gone from recent
         uarchs.
      
      So if a forward jump target is aligned to cacheline boundary then
      prefetches will start from a new prefetch-cacheline and there's
      higher chance for decoding in fewer steps and packing tightly.
      
      But I think that argument is flawed for typical optimized kernel
      code flows: forward jumps often go to 'cold' (uncommon) pieces
      of code, and  aligning cold code to cache lines does not bring a
      lot of advantages  (they are uncommon), while it causes
      collateral damage:
      
       - their alignment 'spreads out' the cache footprint, it shifts
         followup hot code further out
      
       - plus it slows down even 'cold' code that immediately follows 'hot'
         code (like in the above case), which could have benefited from the
         partial cacheline that comes off the end of hot code.
      
      But even in the cache-hot case the 16 byte alignment brings
      disadvantages:
      
       - it spreads out the cache footprint, possibly making the code
         fall out of the L1 I$.
      
       - On Intel CPUs, recent microarchitectures have plenty of
         uop cache (typically doubling every 3 years) - while the
         size of the L1 cache grows much less aggressively. So
         workloads are rarely uop cache limited.
      
      The only situation where alignment might matter are tight
      loops that could fit into a single 16 byte chunk - but those
      are pretty rare in the kernel: if they exist they tend
      to be pointer chasing or generic memory ops, which both tend
      to be cache miss (or cache allocation) intensive and are not
      decoder bandwidth limited.
      
      So the balance of arguments strongly favors packing kernel
      instructions tightly versus maximizing for decoder bandwidth:
      this patch changes the jump target alignment from 16 bytes
      to 1 byte (tightly packed, unaligned).
      
      Acked-by: default avatarDenys Vlasenko <dvlasenk@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Aswin Chandramouleeswaran <aswin@hp.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jason Low <jason.low2@hp.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Link: http://lkml.kernel.org/r/20150410120846.GA17101@gmail.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      be6cb027
  6. May 14, 2015
  7. May 13, 2015
  8. May 11, 2015
    • Stephane Eranian's avatar
      perf/x86/rapl: Enable Broadwell-U RAPL support · 44b11fee
      Stephane Eranian authored
      
      
      This patch enables RAPL counters (energy consumption counters)
      support for Intel Broadwell-U processors (Model 61):
      
      To use:
      
        $ perf stat -a -I 1000 -e power/energy-cores/,power/energy-pkg/,power/energy-ram/ sleep 10
      
      Signed-off-by: default avatarStephane Eranian <eranian@google.com>
      Cc: <stable@vger.kernel.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: jacob.jun.pan@linux.intel.com
      Cc: kan.liang@intel.com
      Cc: peterz@infradead.org
      Cc: sonnyrao@chromium.org
      Link: http://lkml.kernel.org/r/20150423070709.GA4970@thinkpad
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      44b11fee
    • Borislav Petkov's avatar
      x86/alternatives: Switch AMD F15h and later to the P6 NOPs · f21262b8
      Borislav Petkov authored
      
      
      Software optimization guides for both F15h and F16h cite those
      NOPs as the optimal ones. A microbenchmark confirms that
      actually even older families are better with the single-insn
      NOPs so switch to them for the alternatives.
      
      Cycles count below includes the loop overhead of the measurement
      but that overhead is the same with all runs.
      
      	F10h, revE:
      	-----------
      	Running NOP tests, 1000 NOPs x 1000000 repetitions
      
      	K8:
      			      90     288.212282 cycles
      			   66 90     288.220840 cycles
      			66 66 90     288.219447 cycles
      		     66 66 66 90     288.223204 cycles
      		  66 66 90 66 90     571.393424 cycles
      	       66 66 90 66 66 90     571.374919 cycles
      	    66 66 66 90 66 66 90     572.249281 cycles
      	 66 66 66 90 66 66 66 90     571.388651 cycles
      
      	P6:
      			      90     288.214193 cycles
      			   66 90     288.225550 cycles
      			0f 1f 00     288.224441 cycles
      		     0f 1f 40 00     288.225030 cycles
      		  0f 1f 44 00 00     288.233558 cycles
      	       66 0f 1f 44 00 00     324.792342 cycles
      	    0f 1f 80 00 00 00 00     325.657462 cycles
      	 0f 1f 84 00 00 00 00 00     430.246643 cycles
      
      	F14h:
      	----
      	Running NOP tests, 1000 NOPs x 1000000 repetitions
      
      	K8:
      			      90     510.404890 cycles
      			   66 90     510.432117 cycles
      			66 66 90     510.561858 cycles
      		     66 66 66 90     510.541865 cycles
      		  66 66 90 66 90    1014.192782 cycles
      	       66 66 90 66 66 90    1014.226546 cycles
      	    66 66 66 90 66 66 90    1014.334299 cycles
      	 66 66 66 90 66 66 66 90    1014.381205 cycles
      
      	P6:
      			      90     510.436710 cycles
      			   66 90     510.448229 cycles
      			0f 1f 00     510.545100 cycles
      		     0f 1f 40 00     510.502792 cycles
      		  0f 1f 44 00 00     510.589517 cycles
      	       66 0f 1f 44 00 00     510.611462 cycles
      	    0f 1f 80 00 00 00 00     511.166794 cycles
      	 0f 1f 84 00 00 00 00 00     511.651641 cycles
      
      	F15h:
      	-----
      	Running NOP tests, 1000 NOPs x 1000000 repetitions
      
      	K8:
      			      90     243.128396 cycles
      			   66 90     243.129883 cycles
      			66 66 90     243.131631 cycles
      		     66 66 66 90     242.499324 cycles
      		  66 66 90 66 90     481.829083 cycles
      	       66 66 90 66 66 90     481.884413 cycles
      	    66 66 66 90 66 66 90     481.851446 cycles
      	 66 66 66 90 66 66 66 90     481.409220 cycles
      
      	P6:
      			      90     243.127026 cycles
      			   66 90     243.130711 cycles
      			0f 1f 00     243.122747 cycles
      		     0f 1f 40 00     242.497617 cycles
      		  0f 1f 44 00 00     245.354461 cycles
      	       66 0f 1f 44 00 00     361.930417 cycles
      	    0f 1f 80 00 00 00 00     362.844944 cycles
      	 0f 1f 84 00 00 00 00 00     480.514948 cycles
      
      	F16h:
      	-----
      	Running NOP tests, 1000 NOPs x 1000000 repetitions
      
      	K8:
      			      90     507.793298 cycles
      			   66 90     507.789636 cycles
      			66 66 90     507.826490 cycles
      		     66 66 66 90     507.859075 cycles
      		  66 66 90 66 90    1008.663129 cycles
      	       66 66 90 66 66 90    1008.696259 cycles
      	    66 66 66 90 66 66 90    1008.692517 cycles
      	 66 66 66 90 66 66 66 90    1008.755399 cycles
      
      	P6:
      			      90     507.795232 cycles
      			   66 90     507.794761 cycles
      			0f 1f 00     507.834901 cycles
      		     0f 1f 40 00     507.822629 cycles
      		  0f 1f 44 00 00     507.838493 cycles
      	       66 0f 1f 44 00 00     507.908597 cycles
      	    0f 1f 80 00 00 00 00     507.946417 cycles
      	 0f 1f 84 00 00 00 00 00     507.954960 cycles
      
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Aravind Gopalakrishnan <aravind.gopalakrishnan@amd.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1431332153-18566-2-git-send-email-bp@alien8.de
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      f21262b8
    • Oleg Nesterov's avatar
      x86/vdso: Fix 'make bzImage' on older distros · ef7254a5
      Oleg Nesterov authored
      
      
      Change HOST_EXTRACFLAGS to include arch/x86/include/uapi along
      with include/uapi.
      
      This looks more consistent, and this fixes "make bzImage" on my
      old distro which doesn't have asm/bitsperlong.h in /usr/include/.
      
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
      Acked-by: default avatarAndy Lutomirski <luto@kernel.org>
      Cc: <stable@vger.kernel.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 6f121e54 ("x86, vdso: Reimplement vdso.so preparation in build-time C")
      Link: http://lkml.kernel.org/r/1431332153-18566-6-git-send-email-bp@alien8.de
      Link: http://lkml.kernel.org/r/20150507165835.GB18652@redhat.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      ef7254a5
  9. May 10, 2015
  10. May 08, 2015
    • Denys Vlasenko's avatar
      x86/entry: Define 'cpu_current_top_of_stack' for 64-bit code · 3a23208e
      Denys Vlasenko authored
      
      
      32-bit code has PER_CPU_VAR(cpu_current_top_of_stack).
      64-bit code uses somewhat more obscure: PER_CPU_VAR(cpu_tss + TSS_sp0).
      
      Define the 'cpu_current_top_of_stack' macro on CONFIG_X86_64
      as well so that the PER_CPU_VAR(cpu_current_top_of_stack)
      expression can be used in both 32-bit and 64-bit code.
      
      Signed-off-by: default avatarDenys Vlasenko <dvlasenk@redhat.com>
      Cc: Alexei Starovoitov <ast@plumgrid.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Drewry <wad@chromium.org>
      Link: http://lkml.kernel.org/r/1429889495-27850-3-git-send-email-dvlasenk@redhat.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      3a23208e
    • Denys Vlasenko's avatar
      x86/entry: Remove unused 'kernel_stack' per-cpu variable · fed7c3f0
      Denys Vlasenko authored
      
      
      Signed-off-by: default avatarDenys Vlasenko <dvlasenk@redhat.com>
      Acked-by: default avatarAndy Lutomirski <luto@kernel.org>
      Cc: Alexei Starovoitov <ast@plumgrid.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Drewry <wad@chromium.org>
      Link: http://lkml.kernel.org/r/1429889495-27850-2-git-send-email-dvlasenk@redhat.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      fed7c3f0
    • Denys Vlasenko's avatar
      x86/entry: Stop using PER_CPU_VAR(kernel_stack) · 63332a84
      Denys Vlasenko authored
      
      
      PER_CPU_VAR(kernel_stack) is redundant:
      
        - On the 64-bit build, we can use PER_CPU_VAR(cpu_tss + TSS_sp0).
        - On the 32-bit build, we can use PER_CPU_VAR(cpu_current_top_of_stack).
      
      PER_CPU_VAR(kernel_stack) will be deleted by a separate change.
      
      Signed-off-by: default avatarDenys Vlasenko <dvlasenk@redhat.com>
      Cc: Alexei Starovoitov <ast@plumgrid.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Drewry <wad@chromium.org>
      Link: http://lkml.kernel.org/r/1429889495-27850-1-git-send-email-dvlasenk@redhat.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      63332a84
    • Denys Vlasenko's avatar
      x86: Force inlining of atomic ops · 2a4e90b1
      Denys Vlasenko authored
      
      
      With both gcc 4.7.2 and 4.9.2, sometimes gcc mysteriously
      doesn't inline very small functions we expect to be inlined:
      
      $ nm --size-sort vmlinux | grep -iF ' t ' | uniq -c | grep -v '^
      *1 ' | sort -rn     473 000000000000000b t spin_unlock_irqrestore
          449 000000000000005f t rcu_read_unlock
          355 0000000000000009 t atomic_inc                <== THIS
          353 000000000000006e t rcu_read_lock
          350 0000000000000075 t rcu_read_lock_sched_held
          291 000000000000000b t spin_unlock
          266 0000000000000019 t arch_local_irq_restore
          215 000000000000000b t spin_lock
          180 0000000000000011 t kzalloc
          165 0000000000000012 t list_add_tail
          161 0000000000000019 t arch_local_save_flags
          153 0000000000000016 t test_and_set_bit
          134 000000000000000b t spin_unlock_irq
          134 0000000000000009 t atomic_dec                <== THIS
          130 000000000000000b t spin_unlock_bh
          122 0000000000000010 t brelse
          120 0000000000000016 t test_and_clear_bit
          120 000000000000000b t spin_lock_irq
          119 000000000000001e t get_dma_ops
          117 0000000000000053 t cpumask_next
          116 0000000000000036 t kref_get
          114 000000000000001a t schedule_work
          106 000000000000000b t spin_lock_bh
          103 0000000000000019 t arch_local_irq_disable
      ...
      
      Note sizes of marked functions. They are merely 9 bytes long!
      Selecting function with 'atomic' in their names:
      
          355 0000000000000009 t atomic_inc
          134 0000000000000009 t atomic_dec
           98 0000000000000014 t atomic_dec_and_test
           31 000000000000000e t atomic_add_return
           27 000000000000000a t atomic64_inc
           26 000000000000002f t kmap_atomic
           24 0000000000000009 t atomic_add
           12 0000000000000009 t atomic_sub
           10 0000000000000021 t __atomic_add_unless
           10 000000000000000a t atomic64_add
            5 000000000000001f t __atomic_add_unless.constprop.7
            5 000000000000000a t atomic64_dec
            4 000000000000001f t __atomic_add_unless.constprop.18
            4 000000000000001f t __atomic_add_unless.constprop.12
            4 000000000000001f t __atomic_add_unless.constprop.10
            3 000000000000001f t __atomic_add_unless.constprop.13
            3 0000000000000011 t atomic64_add_return
            2 000000000000001f t __atomic_add_unless.constprop.9
            2 000000000000001f t __atomic_add_unless.constprop.8
            2 000000000000001f t __atomic_add_unless.constprop.6
            2 000000000000001f t __atomic_add_unless.constprop.5
            2 000000000000001f t __atomic_add_unless.constprop.3
            2 000000000000001f t __atomic_add_unless.constprop.22
            2 000000000000001f t __atomic_add_unless.constprop.14
            2 000000000000001f t __atomic_add_unless.constprop.11
            2 000000000000001e t atomic_dec_if_positive
            2 0000000000000014 t atomic_inc_and_test
            2 0000000000000011 t atomic_add_return.constprop.4
            2 0000000000000011 t atomic_add_return.constprop.17
            2 0000000000000011 t atomic_add_return.constprop.16
            2 000000000000000d t atomic_inc.constprop.4
            2 000000000000000c t atomic_cmpxchg
      
      This patch fixes this for x86 atomic ops via
      s/inline/__always_inline/. This decreases allyesconfig kernel by
      about 25k:
      
          text     data      bss       dec     hex filename
      82399481 22255416 20627456 125282353 777a831 vmlinux.before
      82375570 22255544 20627456 125258570 7774b4a vmlinux
      
      Signed-off-by: default avatarDenys Vlasenko <dvlasenk@redhat.com>
      Cc: Alexei Starovoitov <ast@plumgrid.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Drewry <wad@chromium.org>
      Link: http://lkml.kernel.org/r/1431080762-17797-1-git-send-email-dvlasenk@redhat.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      2a4e90b1
    • Kan Liang's avatar
      perf/x86/intel: Fix SLM cache event list · 6d374056
      Kan Liang authored
      
      
      iTLB-load-misses and LLC-load-misses count incorrectly on SLM.
      
      There is no ITLB.MISSES support on SLM. Event PAGE_WALKS.I_SIDE_WALK
      should be used to count iTLB-load-misses. This event counts when an
      instruction (I) page walk is completed or started. Since a page walk
      implies a TLB miss, the number of TLB misses can be counted by counting
      the number of pagewalks.
      
      DMND_DATA_RD counts both demand and DCU prefetch data reads. However,
      LLC-load-misses should only count demand reads. There is no way to not
      include prefetches with a single counter on SLM. So the LLC-load-misses
      support should be removed on SLM.
      
      Signed-off-by: default avatarKan Liang <kan.liang@intel.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1429608881-5055-1-git-send-email-kan.liang@intel.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      6d374056
    • Denys Vlasenko's avatar
      x86/asm/entry/64: Clean up usage of TEST insns · 03335e95
      Denys Vlasenko authored
      
      
      By the nature of TEST operation, it is often possible
      to test a narrower part of the operand:
      
          "testl $3, mem"  -> "testb $3, mem"
      
      This results in shorter insns, because TEST insn has no
      sign-entending byte-immediate forms unlike other ALU ops.
      
         text	   data	    bss	    dec	    hex	filename
        11674	      0	      0	  11674	   2d9a	entry_64.o.before
        11658	      0	      0	  11658	   2d8a	entry_64.o
      
      Changes in object code:
      
      -	f7 84 24 88 00 00 00 03 00 00 00 	testl  $0x3,0x88(%rsp)
      +	f6 84 24 88 00 00 00 03	         	testb  $0x3,0x88(%rsp)
      -	f7 44 24 68 03 00 00 00          	testl  $0x3,0x68(%rsp)
      +	f6 44 24 68 03                  	testb  $0x3,0x68(%rsp)
      -	f7 84 24 90 00 00 00 03 00 00 00	testl  $0x3,0x90(%rsp)
      +	f6 84 24 90 00 00 00 03         	testb  $0x3,0x90(%rsp)
      
      Signed-off-by: default avatarDenys Vlasenko <dvlasenk@redhat.com>
      Acked-by: default avatarAndy Lutomirski <luto@kernel.org>
      Cc: Alexei Starovoitov <ast@plumgrid.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Drewry <wad@chromium.org>
      Link: http://lkml.kernel.org/r/1430140912-7960-2-git-send-email-dvlasenk@redhat.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      03335e95
    • Denys Vlasenko's avatar
      x86/asm/entry/64: Tidy up JZ insns after TESTs · dde74f2e
      Denys Vlasenko authored
      
      
      After TESTs, use logically correct JZ/JNZ mnemonics instead of
      JE/JNE. This doesn't change code.
      
      Signed-off-by: default avatarDenys Vlasenko <dvlasenk@redhat.com>
      Acked-by: default avatarAndy Lutomirski <luto@kernel.org>
      Cc: Alexei Starovoitov <ast@plumgrid.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Drewry <wad@chromium.org>
      Link: http://lkml.kernel.org/r/1430140912-7960-1-git-send-email-dvlasenk@redhat.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      dde74f2e
  11. May 06, 2015
  12. May 05, 2015
  13. May 01, 2015
    • Jiang Liu's avatar
      x86/PCI/ACPI: Make all resources except [io 0xcf8-0xcff] available on PCI bus · 2c62e849
      Jiang Liu authored
      An IO port or MMIO resource assigned to a PCI host bridge may be
      consumed by the host bridge itself or available to its child
      bus/devices. The ACPI specification defines a bit (Producer/Consumer)
      to tell whether the resource is consumed by the host bridge itself,
      but firmware hasn't used that bit consistently, so we can't rely on it.
      
      Before commit 593669c2 ("x86/PCI/ACPI: Use common ACPI resource
      interfaces to simplify implementation"), arch/x86/pci/acpi.c ignored
      all IO port resources defined by acpi_resource_io and
      acpi_resource_fixed_io to filter out IO ports consumed by the host
      bridge itself.
      
      Commit 593669c2 ("x86/PCI/ACPI: Use common ACPI resource interfaces
      to simplify implementation") started accepting all IO port and MMIO
      resources, which caused a regression that IO port resources consumed
      by the host bridge itself became available to its child devices.
      
      Then commit 63f1789e ("x86/PCI/ACPI: Ignore resources consumed by
      host bridge itself") ignored resources consumed by the host bridge
      itself by checking the IORESOURCE_WINDOW flag, which accidently removed
      MMIO resources defined by acpi_resource_memory24, acpi_resource_memory32
      and acpi_resource_fixed_memory32.
      
      On x86 and IA64 platforms, all IO port and MMIO resources are assumed
      to be available to child bus/devices except one special case:
          IO port [0xCF8-0xCFF] is consumed by the host bridge itself
          to access PCI configuration space.
      
      So explicitly filter out PCI CFG IO ports[0xCF8-0xCFF]. This solution
      will also ease the way to consolidate ACPI PCI host bridge common code
      from x86, ia64 and ARM64.
      
      Related ACPI table are archived at:
      https://bugzilla.kernel.org/show_bug.cgi?id=94221
      
      Related discussions at:
      http://patchwork.ozlabs.org/patch/461633/
      https://lkml.org/lkml/2015/3/29/304
      
      Fixes: 63f1789e
      
       (Ignore resources consumed by host bridge itself)
      Reported-by: default avatarBernhard Thaler <bernhard.thaler@wvnet.at>
      Signed-off-by: default avatarJiang Liu <jiang.liu@linux.intel.com>
      Cc: 4.0+ <stable@vger.kernel.org> # 4.0+
      Reviewed-by: default avatarBjorn Helgaas <bhelgaas@google.com>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      2c62e849
  14. Apr 30, 2015
    • Boris Ostrovsky's avatar
      xen: Suspend ticks on all CPUs during suspend · 2b953a5e
      Boris Ostrovsky authored
      Commit 77e32c89
      
       ("clockevents: Manage device's state separately for
      the core") decouples clockevent device's modes from states. With this
      change when a Xen guest tries to resume, it won't be calling its
      set_mode op which needs to be done on each VCPU in order to make the
      hypervisor aware that we are in oneshot mode.
      
      This happens because clockevents_tick_resume() (which is an intermediate
      step of resuming ticks on a processor) doesn't call clockevents_set_state()
      anymore and because during suspend clockevent devices on all VCPUs (except
      for the one doing the suspend) are left in ONESHOT state. As result, during
      resume the clockevents state machine will assume that device is already
      where it should be and doesn't need to be updated.
      
      To avoid this problem we should suspend ticks on all VCPUs during
      suspend.
      
      Signed-off-by: default avatarBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Signed-off-by: default avatarDavid Vrabel <david.vrabel@citrix.com>
      2b953a5e
  15. Apr 27, 2015
    • Paolo Bonzini's avatar
      x86: pvclock: Really remove the sched notifier for cross-cpu migrations · 73459e2a
      Paolo Bonzini authored
      This reverts commits 0a4e6be9
      and 80f7fdb1
      
      .
      
      The task migration notifier was originally introduced in order to support
      the pvclock vsyscall with non-synchronized TSC, but KVM only supports it
      with synchronized TSC.  Hence, on KVM the race condition is only needed
      due to a bad implementation on the host side, and even then it's so rare
      that it's mostly theoretical.
      
      As far as KVM is concerned it's possible to fix the host, avoiding the
      additional complexity in the vDSO and the (re)introduction of the task
      migration notifier.
      
      Xen, on the other hand, hasn't yet implemented vsyscall support at
      all, so we do not care about its plans for non-synchronized TSC.
      
      Reported-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Suggested-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      73459e2a
    • Radim Krčmář's avatar
      kvm: x86: fix kvmclock update protocol · 5dca0d91
      Radim Krčmář authored
      
      
      The kvmclock spec says that the host will increment a version field to
      an odd number, then update stuff, then increment it to an even number.
      The host is buggy and doesn't do this, and the result is observable
      when one vcpu reads another vcpu's kvmclock data.
      
      There's no good way for a guest kernel to keep its vdso from reading
      a different vcpu's kvmclock data, but we don't need to care about
      changing VCPUs as long as we read a consistent data from kvmclock.
      (VCPU can change outside of this loop too, so it doesn't matter if we
      return a value not fit for this VCPU.)
      
      Based on a patch by Radim Krčmář.
      
      Reviewed-by: default avatarRadim Krčmář <rkrcmar@redhat.com>
      Acked-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5dca0d91