Skip to content
  1. Nov 18, 2011
  2. Nov 12, 2011
  3. Nov 07, 2011
  4. Nov 03, 2011
    • Andrea Arcangeli's avatar
      thp: share get_huge_page_tail() · b35a35b5
      Andrea Arcangeli authored
      
      
      This avoids duplicating the function in every arch gup_fast.
      
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b35a35b5
    • Andrea Arcangeli's avatar
      mm: thp: tail page refcounting fix · 70b50f94
      Andrea Arcangeli authored
      
      
      Michel while working on the working set estimation code, noticed that
      calling get_page_unless_zero() on a random pfn_to_page(random_pfn)
      wasn't safe, if the pfn ended up being a tail page of a transparent
      hugepage under splitting by __split_huge_page_refcount().
      
      He then found the problem could also theoretically materialize with
      page_cache_get_speculative() during the speculative radix tree lookups
      that uses get_page_unless_zero() in SMP if the radix tree page is freed
      and reallocated and get_user_pages is called on it before
      page_cache_get_speculative has a chance to call get_page_unless_zero().
      
      So the best way to fix the problem is to keep page_tail->_count zero at
      all times.  This will guarantee that get_page_unless_zero() can never
      succeed on any tail page.  page_tail->_mapcount is guaranteed zero and
      is unused for all tail pages of a compound page, so we can simply
      account the tail page references there and transfer them to
      tail_page->_count in __split_huge_page_refcount() (in addition to the
      head_page->_mapcount).
      
      While debugging this s/_count/_mapcount/ change I also noticed get_page is
      called by direct-io.c on pages returned by get_user_pages.  That wasn't
      entirely safe because the two atomic_inc in get_page weren't atomic.  As
      opposed to other get_user_page users like secondary-MMU page fault to
      establish the shadow pagetables would never call any superflous get_page
      after get_user_page returns.  It's safer to make get_page universally safe
      for tail pages and to use get_page_foll() within follow_page (inside
      get_user_pages()).  get_page_foll() is safe to do the refcounting for tail
      pages without taking any locks because it is run within PT lock protected
      critical sections (PT lock for pte and page_table_lock for
      pmd_trans_huge).
      
      The standard get_page() as invoked by direct-io instead will now take
      the compound_lock but still only for tail pages.  The direct-io paths
      are usually I/O bound and the compound_lock is per THP so very
      finegrined, so there's no risk of scalability issues with it.  A simple
      direct-io benchmarks with all lockdep prove locking and spinlock
      debugging infrastructure enabled shows identical performance and no
      overhead.  So it's worth it.  Ideally direct-io should stop calling
      get_page() on pages returned by get_user_pages().  The spinlock in
      get_page() is already optimized away for no-THP builds but doing
      get_page() on tail pages returned by GUP is generally a rare operation
      and usually only run in I/O paths.
      
      This new refcounting on page_tail->_mapcount in addition to avoiding new
      RCU critical sections will also allow the working set estimation code to
      work without any further complexity associated to the tail page
      refcounting with THP.
      
      Signed-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: default avatarMichel Lespinasse <walken@google.com>
      Reviewed-by: default avatarMichel Lespinasse <walken@google.com>
      Reviewed-by: default avatarMinchan Kim <minchan.kim@gmail.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: <stable@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      70b50f94
  5. Nov 02, 2011
  6. Nov 01, 2011
    • Borislav Petkov's avatar
      i7core_edac: Drop the edac_mce facility · 4140c542
      Borislav Petkov authored
      
      
      Remove edac_mce pieces and use the normal MCE decoder notifier chain by
      retaining the same functionality with considerably less code.
      
      Signed-off-by: default avatarBorislav Petkov <borislav.petkov@amd.com>
      Signed-off-by: default avatarMauro Carvalho Chehab <mchehab@redhat.com>
      4140c542
    • Christopher Yeoh's avatar
      Cross Memory Attach · fcf63409
      Christopher Yeoh authored
      The basic idea behind cross memory attach is to allow MPI programs doing
      intra-node communication to do a single copy of the message rather than a
      double copy of the message via shared memory.
      
      The following patch attempts to achieve this by allowing a destination
      process, given an address and size from a source process, to copy memory
      directly from the source process into its own address space via a system
      call.  There is also a symmetrical ability to copy from the current
      process's address space into a destination process's address space.
      
      - Use of /proc/pid/mem has been considered, but there are issues with
        using it:
        - Does not allow for specifying iovecs for both src and dest, assuming
          preadv or pwritev was implemented either the area read from or
        written to would need to be contiguous.
        - Currently mem_read allows only processes who are currently
        ptrace'ing the target and are still able to ptrace the target to read
        from the target. This check could possibly be moved to the open call,
        but its not clear exactly what race this restriction is stopping
        (reason  appears to have been lost)
        - Having to send the fd of /proc/self/mem via SCM_RIGHTS on unix
        domain socket is a bit ugly from a userspace point of view,
        especially when you may have hundreds if not (eventually) thousands
        of processes  that all need to do this with each other
        - Doesn't allow for some future use of the interface we would like to
        consider adding in the future (see below)
        - Interestingly reading from /proc/pid/mem currently actually
        involves two copies! (But this could be fixed pretty easily)
      
      As mentioned previously use of vmsplice instead was considered, but has
      problems.  Since you need the reader and writer working co-operatively if
      the pipe is not drained then you block.  Which requires some wrapping to
      do non blocking on the send side or polling on the receive.  In all to all
      communication it requires ordering otherwise you can deadlock.  And in the
      example of many MPI tasks writing to one MPI task vmsplice serialises the
      copying.
      
      There are some cases of MPI collectives where even a single copy interface
      does not get us the performance gain we could.  For example in an
      MPI_Reduce rather than copy the data from the source we would like to
      instead use it directly in a mathops (say the reduce is doing a sum) as
      this would save us doing a copy.  We don't need to keep a copy of the data
      from the source.  I haven't implemented this, but I think this interface
      could in the future do all this through the use of the flags - eg could
      specify the math operation and type and the kernel rather than just
      copying the data would apply the specified operation between the source
      and destination and store it in the destination.
      
      Although we don't have a "second user" of the interface (though I've had
      some nibbles from people who may be interested in using it for intra
      process messaging which is not MPI).  This interface is something which
      hardware vendors are already doing for their custom drivers to implement
      fast local communication.  And so in addition to this being useful for
      OpenMPI it would mean the driver maintainers don't have to fix things up
      when the mm changes.
      
      There was some discussion about how much faster a true zero copy would
      go. Here's a link back to the email with some testing I did on that:
      
      http://marc.info/?l=linux-mm&m=130105930902915&w=2
      
      There is a basic man page for the proposed interface here:
      
      http://ozlabs.org/~cyeoh/cma/process_vm_readv.txt
      
      This has been implemented for x86 and powerpc, other architecture should
      mainly (I think) just need to add syscall numbers for the process_vm_readv
      and process_vm_writev. There are 32 bit compatibility versions for
      64-bit kernels.
      
      For arch maintainers there are some simple tests to be able to quickly
      verify that the syscalls are working correctly here:
      
      http://ozlabs.org/~cyeoh/cma/cma-test-20110718.tgz
      
      
      
      Signed-off-by: default avatarChris Yeoh <yeohc@au1.ibm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: <linux-man@vger.kernel.org>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      fcf63409
    • Paul Gortmaker's avatar
      lguest: add export.h to lguest files for THIS_MODULE/EXPORT_SYMBOL · 39a0e33d
      Paul Gortmaker authored
      
      
      We need this in advance of the module.h cleanup, or we'll
      get compile errors like this:
      
        CC      drivers/lguest/lguest_device.o
      drivers/lguest/lguest_device.c: In function ‘lguest_devices_init’:
      drivers/lguest/lguest_device.c:490: error: ‘THIS_MODULE’ undeclared (first use in this function)
      
      Signed-off-by: default avatarPaul Gortmaker <paul.gortmaker@windriver.com>
      39a0e33d
    • Paul Gortmaker's avatar
      x86: efi_32.c is implicitly getting asm/desc.h via module.h · 783ac47c
      Paul Gortmaker authored
      
      
      We want to clean up the chain of includes stumbling through
      module.h, and when we do that, we'll see:
      
        CC      arch/x86/platform/efi/efi_32.o
        efi/efi_32.c: In function ‘efi_call_phys_prelog’:
        efi/efi_32.c:80: error: implicit declaration of function ‘get_cpu_gdt_table’
        efi/efi_32.c:82: error: implicit declaration of function ‘load_gdt’
        make[4]: *** [arch/x86/platform/efi/efi_32.o] Error 1
      
      Include asm/desc.h so that there are no implicit include assumptions.
      
      Signed-off-by: default avatarPaul Gortmaker <paul.gortmaker@windriver.com>
      783ac47c
    • Paul Gortmaker's avatar
      x86: fix up files really needing to include module.h · 7c52d551
      Paul Gortmaker authored
      
      
      These files aren't just exporting symbols -- they are also defining
      a MODULE_LICENSE etc. so give them the full module.h file.
      
      Signed-off-by: default avatarPaul Gortmaker <paul.gortmaker@windriver.com>
      7c52d551
    • Paul Gortmaker's avatar
      x86: Fix files explicitly requiring export.h for EXPORT_SYMBOL/THIS_MODULE · 69c60c88
      Paul Gortmaker authored
      
      
      These files were implicitly getting EXPORT_SYMBOL via device.h
      which was including module.h, but that will be fixed up shortly.
      
      By fixing these now, we can avoid seeing things like:
      
      arch/x86/kernel/rtc.c:29: warning: type defaults to ‘int’ in declaration of ‘EXPORT_SYMBOL’
      arch/x86/kernel/pci-dma.c:20: warning: type defaults to ‘int’ in declaration of ‘EXPORT_SYMBOL’
      arch/x86/kernel/e820.c:69: warning: type defaults to ‘int’ in declaration of ‘EXPORT_SYMBOL_GPL’
      
      [ with input from Randy Dunlap <rdunlap@xenotime.net> and also
        from Stephen Rothwell <sfr@canb.auug.org.au> ]
      
      Signed-off-by: default avatarPaul Gortmaker <paul.gortmaker@windriver.com>
      69c60c88