Skip to content
  1. Nov 12, 2017
  2. Nov 10, 2017
    • Nicholas Piggin's avatar
      powerpc/64: Set DSCR default initially from SPR · 1696d0fb
      Nicholas Piggin authored
      
      
      Take the DSCR value set by firmware as the dscr_default value,
      rather than zero.
      
      POWER9 recommends DSCR default to a non-zero value.
      
      Signed-off-by: default avatarFrom: Nicholas Piggin <npiggin@gmail.com>
      [mpe: Make record_spr_defaults() __init]
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      1696d0fb
    • Nicholas Piggin's avatar
      powerpc/powernv: Avoid waiting for secondary hold spinloop with OPAL · 339a3293
      Nicholas Piggin authored
      
      
      OPAL boot does not insert secondaries at 0x60 to wait at the secondary
      hold spinloop. Instead they are started later, and inserted at
      generic_secondary_smp_init(), which is after the secondary hold
      spinloop.
      
      Avoid waiting on this spinloop when booting with OPAL firmware. This
      wait always times out that case.
      
      This saves 100ms boot time on powernv, and 10s of seconds of real time
      when booting on the simulator in SMP.
      
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      339a3293
    • Nicholas Piggin's avatar
      powerpc/64s/radix: Improve TLB flushing for page table freeing · 0b2f5a8a
      Nicholas Piggin authored
      
      
      Unmaps that free page tables always flush the entire PID, which is
      sub-optimal. Provide TLB range flushing with an additional PWC flush
      that can be use for va range invalidations with PWC flush.
      
           Time to munmap N pages of memory including last level page table
           teardown (after mmap, touch), local invalidate:
           N           1       2      4      8     16     32     64
           vanilla  3.2us  3.3us  3.4us  3.6us  4.1us  5.2us  7.2us
           patched  1.4us  1.5us  1.7us  1.9us  2.6us  3.7us  6.2us
      
           Global invalidate:
           N           1       2      4      8     16      32     64
           vanilla  2.2us  2.3us  2.4us  2.6us  3.2us   4.1us  6.2us
           patched  2.1us  2.5us  3.4us  5.2us  8.7us  15.7us  6.2us
      
      Local invalidates get much better across the board. Global ones have
      the same issue where multiple tlbies for va flush do get slower than
      the single tlbie to invalidate the PID. None of this test captures
      the TLB benefits of avoiding killing everything.
      
      Global gets worse, but it is brought in to line with global invalidate
      for munmap()s that do not free page tables.
      
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      0b2f5a8a
    • Nicholas Piggin's avatar
      powerpc/64s/radix: Introduce local single page ceiling for TLB range flush · f6f27951
      Nicholas Piggin authored
      
      
      The single page flush ceiling is the cut-off point at which we switch
      from invalidating individual pages, to invalidating the entire process
      address space in response to a range flush.
      
      Introduce a local variant of this heuristic because local and global
      tlbie have significantly different properties:
      - Local tlbiel requires 128 instructions to invalidate a PID, global
        tlbie only 1 instruction.
      - Global tlbie instructions are expensive broadcast operations.
      
      The local ceiling has been made much higher, 2x the number of
      instructions required to invalidate the entire PID (i.e., 256 pages).
      
           Time to mprotect N pages of memory (after mmap, touch), local invalidate:
           N           32     34      64     128     256     512
           vanilla  7.4us  9.0us  14.6us  26.4us  50.2us  98.3us
           patched  7.4us  7.8us  13.8us  26.4us  51.9us  98.3us
      
      The behaviour of both is identical at N=32 and N=512. Between there,
      the vanilla kernel does a PID invalidate and the patched kernel does
      a va range invalidate.
      
      At N=128, these require the same number of tlbiel instructions, so
      the patched version can be sen to be cheaper when < 128, and more
      expensive when > 128. However this does not well capture the cost
      of invalidated TLB.
      
      The additional cost at 256 pages does not seem prohibitive. It may
      be the case that increasing the limit further would continue to be
      beneficial to avoid invalidating all of the process's TLB entries.
      
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      f6f27951
    • Nicholas Piggin's avatar
      powerpc/64s/radix: Optimize flush_tlb_range · cbf09c83
      Nicholas Piggin authored
      
      
      Currently for radix, flush_tlb_range flushes the entire PID, because
      the Linux mm code does not tell us about page size here for THP vs
      regular pages. This is quite sub-optimal for small mremap / mprotect
      / change_protection.
      
      So implement va range flushes with two flush passes, one for each
      page size (regular and THP). The second flush has an order of matnitude
      fewer tlbie instructions than the first, so it is a relatively small
      additional cost.
      
      There is still room for improvement here with some changes to generic
      APIs, particularly if there are mostly THP pages to be invalidated,
      the small page flushes could be reduced.
      
      Time to mprotect 1 page of memory (after mmap, touch):
      vanilla 2.9us   1.8us
      patched 1.2us   1.6us
      
      Time to mprotect 30 pages of memory (after mmap, touch):
      vanilla 8.2us   7.2us
      patched 6.9us   17.9us
      
      Time to mprotect 34 pages of memory (after mmap, touch):
      vanilla 9.1us   8.0us
      patched 9.0us   8.0us
      
      34 pages is the point at which the invalidation switches from va
      to entire PID, which tlbie can do in a single instruction. This is
      why in the case of 30 pages, the new code runs slower for this test.
      This is a deliberate tradeoff already present in the unmap and THP
      promotion code, the idea is that the benefit from avoiding flushing
      entire TLB for this PID on all threads in the system.
      
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      cbf09c83
    • Nicholas Piggin's avatar
      powerpc/64s/radix: Implement _tlbie(l)_va_range flush functions · d665767e
      Nicholas Piggin authored
      
      
      Move the barriers and range iteration down into the _tlbie* level,
      which improves readability.
      
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      d665767e
    • Nicholas Piggin's avatar
      powerpc/64s/radix: Optimize TLB range flush barriers · 14001c60
      Nicholas Piggin authored
      Short range flushes issue a sequences of tlbie(l) instructions for
      individual effective addresses. These do not all require individual
      barrier sequences, only one covering all tlbie(l) instructions.
      
      Commit f7327e0b
      
       ("powerpc/mm/radix: Remove unnecessary ptesync")
      made a similar optimization for tlbiel for PID flushing.
      
      For tlbie, the ISA says:
      
          The tlbsync instruction provides an ordering function for the
          effects of all tlbie instructions executed by the thread executing
          the tlbsync instruction, with respect to the memory barrier
          created by a subsequent ptesync instruction executed by the same
          thread.
      
      Time to munmap 30 pages of memory (after mmap, touch):
               local   global
      vanilla  10.9us  22.3us
      patched   3.4us  14.4us
      
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      14001c60
    • Michael Ellerman's avatar
      Merge branch 'fixes' into next · a54c61f4
      Michael Ellerman authored
      We have some dependencies & conflicts between patches in fixes and
      things to go in next, both in the radix TLB flush code and the IMC PMU
      driver. So merge fixes into next.
      a54c61f4
  3. Nov 09, 2017
  4. Nov 08, 2017
    • Balbir Singh's avatar
      powerpc/xmon: Support dumping software pagetables · 80eff6c4
      Balbir Singh authored
      
      
      It would be nice to be able to dump page tables in a particular
      context.
      
      eg: dumping vmalloc space:
      
        0:mon> dv 0xd00037fffff00000
        pgd  @ 0xc0000000017c0000
        pgdp @ 0xc0000000017c00d8 = 0x00000000f10b1000
        pudp @ 0xc0000000f10b13f8 = 0x00000000f10d0000
        pmdp @ 0xc0000000f10d1ff8 = 0x00000000f1102000
        ptep @ 0xc0000000f1102780 = 0xc0000000f1ba018e
        Maps physical address = 0x00000000f1ba0000
        Flags = Accessed Dirty Read Write
      
      This patch does not replicate the complex code of dump_pagetable and
      has no support for bolted linear mapping, thats why I've it's called
      dump virtual page table support. The format of the PTE can be expanded
      even further to add more useful information about the flags in the PTE
      if required.
      
      Signed-off-by: default avatarBalbir Singh <bsingharora@gmail.com>
      [mpe: Bike shed the output format, show the pgdir, fix build failures]
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      80eff6c4
  5. Nov 07, 2017
  6. Nov 06, 2017