Skip to content
  1. Dec 20, 2013
    • Kees Cook's avatar
      stackprotector: Unify the HAVE_CC_STACKPROTECTOR logic between architectures · 19952a92
      Kees Cook authored
      
      
      Instead of duplicating the CC_STACKPROTECTOR Kconfig and
      Makefile logic in each architecture, switch to using
      HAVE_CC_STACKPROTECTOR and keep everything in one place. This
      retains the x86-specific bug verification scripts.
      
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Michal Marek <mmarek@suse.cz>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Shawn Guo <shawn.guo@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: linux-mips@linux-mips.org
      Cc: linux-arch@vger.kernel.org
      Link: http://lkml.kernel.org/r/1387481759-14535-2-git-send-email-keescook@chromium.org
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      19952a92
  2. Dec 13, 2013
    • Gleb Natapov's avatar
      KVM: x86: fix guest-initiated crash with x2apic (CVE-2013-6376) · 17d68b76
      Gleb Natapov authored
      
      
      A guest can cause a BUG_ON() leading to a host kernel crash.
      When the guest writes to the ICR to request an IPI, while in x2apic
      mode the following things happen, the destination is read from
      ICR2, which is a register that the guest can control.
      
      kvm_irq_delivery_to_apic_fast uses the high 16 bits of ICR2 as the
      cluster id.  A BUG_ON is triggered, which is a protection against
      accessing map->logical_map with an out-of-bounds access and manages
      to avoid that anything really unsafe occurs.
      
      The logic in the code is correct from real HW point of view. The problem
      is that KVM supports only one cluster with ID 0 in clustered mode, but
      the code that has the bug does not take this into account.
      
      Reported-by: default avatarLars Bull <larsbull@google.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarGleb Natapov <gleb@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      17d68b76
    • Andy Honig's avatar
      KVM: x86: Convert vapic synchronization to _cached functions (CVE-2013-6368) · fda4e2e8
      Andy Honig authored
      
      
      In kvm_lapic_sync_from_vapic and kvm_lapic_sync_to_vapic there is the
      potential to corrupt kernel memory if userspace provides an address that
      is at the end of a page.  This patches concerts those functions to use
      kvm_write_guest_cached and kvm_read_guest_cached.  It also checks the
      vapic_address specified by userspace during ioctl processing and returns
      an error to userspace if the address is not a valid GPA.
      
      This is generally not guest triggerable, because the required write is
      done by firmware that runs before the guest.  Also, it only affects AMD
      processors and oldish Intel that do not have the FlexPriority feature
      (unless you disable FlexPriority, of course; then newer processors are
      also affected).
      
      Fixes: b93463aa ('KVM: Accelerated apic support')
      
      Reported-by: default avatarAndrew Honig <ahonig@google.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarAndrew Honig <ahonig@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      fda4e2e8
    • Andy Honig's avatar
      KVM: x86: Fix potential divide by 0 in lapic (CVE-2013-6367) · b963a22e
      Andy Honig authored
      
      
      Under guest controllable circumstances apic_get_tmcct will execute a
      divide by zero and cause a crash.  If the guest cpuid support
      tsc deadline timers and performs the following sequence of requests
      the host will crash.
      - Set the mode to periodic
      - Set the TMICT to 0
      - Set the mode bits to 11 (neither periodic, nor one shot, nor tsc deadline)
      - Set the TMICT to non-zero.
      Then the lapic_timer.period will be 0, but the TMICT will not be.  If the
      guest then reads from the TMCCT then the host will perform a divide by 0.
      
      This patch ensures that if the lapic_timer.period is 0, then the division
      does not occur.
      
      Reported-by: default avatarAndrew Honig <ahonig@google.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarAndrew Honig <ahonig@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b963a22e
  3. Dec 11, 2013
  4. Dec 10, 2013
  5. Dec 05, 2013
  6. Dec 04, 2013
  7. Nov 29, 2013
  8. Nov 22, 2013
  9. Nov 21, 2013
  10. Nov 20, 2013
  11. Nov 19, 2013
  12. Nov 17, 2013
    • Ramkumar Ramachandra's avatar
      um/vdso: add .gitignore for a couple of targets · b13a9bfc
      Ramkumar Ramachandra authored
      
      
      Cc: Richard Weinberger <richard@nod.at>
      Signed-off-by: default avatarRamkumar Ramachandra <artagnon@gmail.com>
      Signed-off-by: default avatarRichard Weinberger <richard@nod.at>
      b13a9bfc
    • Ramkumar Ramachandra's avatar
      arch/um: make it work with defconfig and x86_64 · e40f04d0
      Ramkumar Ramachandra authored
      
      
      arch/um/defconfig only lists one default configuration, and that applies
      only to the i386 architecture.  Replace it with two minimal
      configuration files generated using `make savedefconfig`:
      
        i386_defconfig and x86_64_defconfig
      
      The build scripts now require two updates:
      
      1. um's Kconfig (arch/x86/um/Kconfig) should specify an ARCH_DEFCONFIG
         section explicitly pointing to these scripts if the required
         variables are set.  Take care to remove the DEFCONFIG_LIST section
         defined in the included file arch/um/Kconfig.common.
      
      2. um's Makefile (arch/um/Makefile) should set KBUILD_DEFCONFIG properly
         for the top-level Makefile to pick up.  Copy the logic in
         arch/x86/Makefile to properly pick the defconfig file depending on
         the actual architecture; except we're working with $SUBARCH here,
         instead of $ARCH.
      
      Now, you can do:
      
        $ ARCH=um make defconfig
        $ ARCH=um make
      
      and successfully build User-Mode Linux on an x86_64 box in default
      configuration.
      
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Jeff Dike <jdike@addtoit.com>
      Signed-off-by: default avatarRamkumar Ramachandra <artagnon@gmail.com>
      Signed-off-by: default avatarRichard Weinberger <richard@nod.at>
      e40f04d0
    • Richard Weinberger's avatar
      um: Rewrite show_stack() · 9d1ee8ce
      Richard Weinberger authored
      
      
      Currently on UML stack traces are not very reliable and both
      x86 and x86_64 have their on implementations.
      This patch unifies both and adds support to outline unreliable
      functions calls.
      
      Signed-off-by: default avatarRichard Weinberger <richard@nod.at>
      9d1ee8ce
  13. Nov 15, 2013
    • David Rientjes's avatar
      x86: Export 'boot_cpu_physical_apicid' to modules · cc08e04c
      David Rientjes authored
      
      
      Commit 9ebddac7 "ACPI, x86: Fix extended error log driver to depend on
      CONFIG_X86_LOCAL_APIC" fixed a build error when CONFIG_X86_LOCAL_APIC was not
      selected and !CONFIG_SMP.
      
      However, since CONFIG_ACPI_EXTLOG is tristate, there is a second build error:
      
        ERROR: "boot_cpu_physical_apicid" [drivers/acpi/acpi_extlog.ko] undefined!
      
      The symbol needs to be exported for it to be available.
      
      Signed-off-by: default avatarDavid Rientjes <rientjes@google.com>
      Acked-by: default avatarTony Luck <tony.luck@intel.com>
      Cc: Chen Gong <gong.chen@linux.intel.com>
      Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
      Link: http://lkml.kernel.org/r/alpine.DEB.2.02.1311141504080.30112@chino.kir.corp.google.com
      
      
      [ Changed it to a _GPL() export. ]
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      cc08e04c
    • Christoph Hellwig's avatar
      kernel: remove CONFIG_USE_GENERIC_SMP_HELPERS · 0a06ff06
      Christoph Hellwig authored
      
      
      We've switched over every architecture that supports SMP to it, so
      remove the new useless config variable.
      
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0a06ff06
    • Kirill A. Shutemov's avatar
      mm: dynamically allocate page->ptl if it cannot be embedded to struct page · 49076ec2
      Kirill A. Shutemov authored
      
      
      If split page table lock is in use, we embed the lock into struct page
      of table's page.  We have to disable split lock, if spinlock_t is too
      big be to be embedded, like when DEBUG_SPINLOCK or DEBUG_LOCK_ALLOC
      enabled.
      
      This patch add support for dynamic allocation of split page table lock
      if we can't embed it to struct page.
      
      page->ptl is unsigned long now and we use it as spinlock_t if
      sizeof(spinlock_t) <= sizeof(long), otherwise it's pointer to spinlock_t.
      
      The spinlock_t allocated in pgtable_page_ctor() for PTE table and in
      pgtable_pmd_page_ctor() for PMD table.  All other helpers converted to
      support dynamically allocated page->ptl.
      
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      49076ec2
    • Kirill A. Shutemov's avatar
      x86: handle pgtable_page_ctor() fail · cecbd1b5
      Kirill A. Shutemov authored
      
      
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cecbd1b5
    • Kirill A. Shutemov's avatar
      x86: add missed pgtable_pmd_page_ctor/dtor calls for preallocated pmds · 09ef4939
      Kirill A. Shutemov authored
      
      
      In split page table lock case, we embed spinlock_t into struct page.
      For obvious reason, we don't want to increase size of struct page if
      spinlock_t is too big, like with DEBUG_SPINLOCK or DEBUG_LOCK_ALLOC or
      on -rt kernel.  So we disable split page table lock, if spinlock_t is
      too big.
      
      This patchset allows to allocate the lock dynamically if spinlock_t is
      big.  In this page->ptl is used to store pointer to spinlock instead of
      spinlock itself.  It costs additional cache line for indirect access,
      but fix page fault scalability for multi-threaded applications.
      
      LOCK_STAT depends on DEBUG_SPINLOCK, so on current kernel enabling
      LOCK_STAT to analyse scalability issues breaks scalability.  ;)
      
      The patchset mostly fixes this.  Results for ./thp_memscale -c 80 -b 512M
      on 4-socket machine:
      
      baseline, no CONFIG_LOCK_STAT:	9.115460703 seconds time elapsed
      baseline, CONFIG_LOCK_STAT=y:	53.890567123 seconds time elapsed
      patched, no CONFIG_LOCK_STAT:	8.852250368 seconds time elapsed
      patched, CONFIG_LOCK_STAT=y:	11.069770759 seconds time elapsed
      
      Patch count is scary, but most of them trivial. Overview:
      
       Patches 1-4	Few bug fixes. No dependencies to other patches.
      		Probably should applied as soon as possible.
      
       Patch 5	Changes signature of pgtable_page_ctor(). We will use it
      		for dynamic lock allocation, so it can fail.
      
       Patches 6-8	Add missing constructor/destructor calls on few archs.
      		It's fixes NR_PAGETABLE accounting and prepare to use
      		split ptl.
      
       Patches 9-33	Add pgtable_page_ctor() fail handling to all archs.
      
       Patches 34	Finally adds support of dynamically-allocated page->pte.
      		Also contains documentation for split page table lock.
      
      This patch (of 34):
      
      I've missed that we preallocate few pmds on pgd_alloc() if X86_PAE
      enabled.  Let's add missed constructor/destructor calls.
      
      I haven't noticed it during testing since prep_new_page() clears
      page->mapping and therefore page->ptl.  It's effectively equal to
      spin_lock_init(&page->ptl).
      
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chen Liqin <liqin.chen@sunplusct.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Grant Likely <grant.likely@linaro.org>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
      Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Hirokazu Takata <takata@linux-m32r.org>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Jesper Nilsson <jesper.nilsson@axis.com>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com>
      Cc: Lennox Wu <lennox.wu@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rob Herring <rob.herring@calxeda.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      09ef4939
    • Kirill A. Shutemov's avatar
      x86, mm: enable split page table lock for PMD level · 9491846f
      Kirill A. Shutemov authored
      
      
      Enable PMD split page table lock for X86_64 and PAE.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: default avatarAlex Thorlton <athorlton@sgi.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Robin Holt <robinmholt@gmail.com>
      Cc: Sedat Dilek <sedat.dilek@gmail.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Hugh Dickins <hughd@google.com>
      Reviewed-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9491846f
    • Kirill A. Shutemov's avatar
      mm: rename USE_SPLIT_PTLOCKS to USE_SPLIT_PTE_PTLOCKS · 57c1ffce
      Kirill A. Shutemov authored
      
      
      We're going to introduce split page table lock for PMD level.  Let's
      rename existing split ptlock for PTE level to avoid confusion.
      
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: default avatarAlex Thorlton <athorlton@sgi.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Robin Holt <robinmholt@gmail.com>
      Cc: Sedat Dilek <sedat.dilek@gmail.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      57c1ffce
    • Rafael J. Wysocki's avatar
      ACPI / driver core: Store an ACPI device pointer in struct acpi_dev_node · 7b199811
      Rafael J. Wysocki authored
      
      
      Modify struct acpi_dev_node to contain a pointer to struct acpi_device
      associated with the given device object (that is, its ACPI companion
      device) instead of an ACPI handle corresponding to it.  Introduce two
      new macros for manipulating that pointer in a CONFIG_ACPI-safe way,
      ACPI_COMPANION() and ACPI_COMPANION_SET(), and rework the
      ACPI_HANDLE() macro to take the above changes into account.
      Drop the ACPI_HANDLE_SET() macro entirely and rework its users to
      use ACPI_COMPANION_SET() instead.  For some of them who used to
      pass the result of acpi_get_child() directly to ACPI_HANDLE_SET()
      introduce a helper routine acpi_preset_companion() doing an
      equivalent thing.
      
      The main motivation for doing this is that there are things
      represented by struct acpi_device objects that don't have valid
      ACPI handles (so called fixed ACPI hardware features, such as
      power and sleep buttons) and we would like to create platform
      device objects for them and "glue" them to their ACPI companions
      in the usual way (which currently is impossible due to the
      lack of valid ACPI handles).  However, there are more reasons
      why it may be useful.
      
      First, struct acpi_device pointers allow of much better type checking
      than void pointers which are ACPI handles, so it should be more
      difficult to write buggy code using modified struct acpi_dev_node
      and the new macros.  Second, the change should help to reduce (over
      time) the number of places in which the result of ACPI_HANDLE() is
      passed to acpi_bus_get_device() in order to obtain a pointer to the
      struct acpi_device associated with the given "physical" device,
      because now that pointer is returned by ACPI_COMPANION() directly.
      Finally, the change should make it easier to write generic code that
      will build both for CONFIG_ACPI set and unset without adding explicit
      compiler directives to it.
      
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Tested-by: Mika Westerberg <mika.westerberg@linux.intel.com> # on Haswell
      Reviewed-by: default avatarMika Westerberg <mika.westerberg@linux.intel.com>
      Reviewed-by: Aaron Lu <aaron.lu@intel.com> # for ATA and SDIO part
      7b199811
  14. Nov 14, 2013
  15. Nov 13, 2013
    • Vineet Gupta's avatar
      x86: move fpu_counter into ARCH specific thread_struct · c375f15a
      Vineet Gupta authored
      
      
      Only a couple of arches (sh/x86) use fpu_counter in task_struct so it can
      be moved out into ARCH specific thread_struct, reducing the size of
      task_struct for other arches.
      
      Compile tested i386_defconfig + gcc 4.7.3
      
      Signed-off-by: default avatarVineet Gupta <vgupta@synopsys.com>
      Acked-by: default avatarIngo Molnar <mingo@kernel.org>
      Cc: Paul Mundt <paul.mundt@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c375f15a
    • Zhi Yong Wu's avatar
    • Tang Chen's avatar
      mem-hotplug: introduce movable_node boot option · c5320926
      Tang Chen authored
      
      
      The hot-Pluggable field in SRAT specifies which memory is hotpluggable.
      As we mentioned before, if hotpluggable memory is used by the kernel, it
      cannot be hot-removed.  So memory hotplug users may want to set all
      hotpluggable memory in ZONE_MOVABLE so that the kernel won't use it.
      
      Memory hotplug users may also set a node as movable node, which has
      ZONE_MOVABLE only, so that the whole node can be hot-removed.
      
      But the kernel cannot use memory in ZONE_MOVABLE.  By doing this, the
      kernel cannot use memory in movable nodes.  This will cause NUMA
      performance down.  And other users may be unhappy.
      
      So we need a way to allow users to enable and disable this functionality.
      In this patch, we introduce movable_node boot option to allow users to
      choose to not to consume hotpluggable memory at early boot time and later
      we can set it as ZONE_MOVABLE.
      
      To achieve this, the movable_node boot option will control the memblock
      allocation direction.  That said, after memblock is ready, before SRAT is
      parsed, we should allocate memory near the kernel image as we explained in
      the previous patches.  So if movable_node boot option is set, the kernel
      does the following:
      
      1. After memblock is ready, make memblock allocate memory bottom up.
      2. After SRAT is parsed, make memblock behave as default, allocate memory
         top down.
      
      Users can specify "movable_node" in kernel commandline to enable this
      functionality.  For those who don't use memory hotplug or who don't want
      to lose their NUMA performance, just don't specify anything.  The kernel
      will work as before.
      
      Signed-off-by: default avatarTang Chen <tangchen@cn.fujitsu.com>
      Signed-off-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Suggested-by: default avatarKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Suggested-by: default avatarIngo Molnar <mingo@kernel.org>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarToshi Kani <toshi.kani@hp.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Thomas Renninger <trenn@suse.de>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c5320926
    • Tang Chen's avatar
      x86, acpi, crash, kdump: do reserve_crashkernel() after SRAT is parsed. · fa591c4a
      Tang Chen authored
      
      
      Memory reserved for crashkernel could be large.  So we should not allocate
      this memory bottom up from the end of kernel image.
      
      When SRAT is parsed, we will be able to know which memory is hotpluggable,
      and we can avoid allocating this memory for the kernel.  So reorder
      reserve_crashkernel() after SRAT is parsed.
      
      Signed-off-by: default avatarTang Chen <tangchen@cn.fujitsu.com>
      Signed-off-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarToshi Kani <toshi.kani@hp.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Thomas Renninger <trenn@suse.de>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fa591c4a
    • Tang Chen's avatar
      x86/mem-hotplug: support initialize page tables in bottom-up · b959ed6c
      Tang Chen authored
      
      
      The Linux kernel cannot migrate pages used by the kernel.  As a result,
      kernel pages cannot be hot-removed.  So we cannot allocate hotpluggable
      memory for the kernel.
      
      In a memory hotplug system, any numa node the kernel resides in should be
      unhotpluggable.  And for a modern server, each node could have at least
      16GB memory.  So memory around the kernel image is highly likely
      unhotpluggable.
      
      ACPI SRAT (System Resource Affinity Table) contains the memory hotplug
      info.  But before SRAT is parsed, memblock has already started to allocate
      memory for the kernel.  So we need to prevent memblock from doing this.
      
      So direct memory mapping page tables setup is the case.
      init_mem_mapping() is called before SRAT is parsed.  To prevent page
      tables being allocated within hotpluggable memory, we will use bottom-up
      direction to allocate page tables from the end of kernel image to the
      higher memory.
      
      Note:
      As for allocating page tables in lower memory, TJ said:
      
      : This is an optional behavior which is triggered by a very specific kernel
      : boot param, which I suspect is gonna need to stick around to support
      : memory hotplug in the current setup unless we add another layer of address
      : translation to support memory hotplug.
      
      As for page tables may occupy too much lower memory if using 4K mapping
      (CONFIG_DEBUG_PAGEALLOC and CONFIG_KMEMCHECK both disable using >4k
      pages), TJ said:
      
      : But as I said in the same paragraph, parsing SRAT earlier doesn't solve
      : the problem in itself either.  Ignoring the option if 4k mapping is
      : required and memory consumption would be prohibitive should work, no?
      : Something like that would be necessary if we're gonna worry about cases
      : like this no matter how we implement it, but, frankly, I'm not sure this
      : is something worth worrying about.
      
      Signed-off-by: default avatarTang Chen <tangchen@cn.fujitsu.com>
      Signed-off-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarToshi Kani <toshi.kani@hp.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Thomas Renninger <trenn@suse.de>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b959ed6c
    • Tang Chen's avatar
      x86/mm: factor out of top-down direct mapping setup · 0167d7d8
      Tang Chen authored
      
      
      Create a new function memory_map_top_down to factor out of the top-down
      direct memory mapping pagetable setup.  This is also a preparation for the
      following patch, which will introduce the bottom-up memory mapping.  That
      said, we will put the two ways of pagetable setup into separate functions,
      and choose to use which way in init_mem_mapping, which makes the code more
      clear.
      
      Signed-off-by: default avatarTang Chen <tangchen@cn.fujitsu.com>
      Signed-off-by: default avatarZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarToshi Kani <toshi.kani@hp.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Thomas Renninger <trenn@suse.de>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0167d7d8
    • Jianguo Wu's avatar
      mm/arch: use NUMA_NO_NODE · 40c3baa7
      Jianguo Wu authored
      
      
      Use more appropriate NUMA_NO_NODE instead of -1 in all archs' module_alloc()
      
      Signed-off-by: default avatarJianguo Wu <wujianguo@huawei.com>
      Acked-by: default avatarDavid Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      40c3baa7
    • Len Brown's avatar
      tools / power turbostat: Support Silvermont · 144b44b1
      Len Brown authored
      
      
      Support the next generation Intel Atom processor
      mirco-architecture, formerly called Silvermont.
      
      The server version, formerly called "Avoton",
      is named the "Intel(R) Atom(TM) Processor C2000 Product Family".
      
      The client version, formerly called "Bay Trail",
      is named the "Intel Atom Processor Z3000 Series",
      as well as various "Intel Pentium Processor"
      and "Intel Celeron Processor" brands, depending
      on form-factor.
      
      Silvermont has a set of MSRs not far off from NHM,
      but the RAPL register set is a sub-set of those previously supported.
      
      Signed-off-by: default avatarLen Brown <len.brown@intel.com>
      Signed-off-by: default avatarRafael J. Wysocki <rafael.j.wysocki@intel.com>
      144b44b1