Skip to content
  1. Oct 22, 2021
    • Nathan Lynch's avatar
      powerpc/pseries/mobility: ignore ibm, platform-facilities updates · 319fa1a5
      Nathan Lynch authored
      On VMs with NX encryption, compression, and/or RNG offload, these
      capabilities are described by nodes in the ibm,platform-facilities device
      tree hierarchy:
      
        $ tree -d /sys/firmware/devicetree/base/ibm,platform-facilities/
        /sys/firmware/devicetree/base/ibm,platform-facilities/
        ├── ibm,compression-v1
        ├── ibm,random-v1
        └── ibm,sym-encryption-v1
      
        3 directories
      
      The acceleration functions that these nodes describe are not disrupted by
      live migration, not even temporarily.
      
      But the post-migration ibm,update-nodes sequence firmware always sends
      "delete" messages for this hierarchy, followed by an "add" directive to
      reconstruct it via ibm,configure-connector (log with debugging statements
      enabled in mobility.c):
      
        mobility: removing node /ibm,platform-facilities/ibm,random-v1:4294967285
        mobility: removing node /ibm,platform-facilities/ibm,compression-v1:4294967284
        mobility: removing node /ibm,platform-facilities/ibm,sym-encryption-v1:4294967283
        mobility: removing node /ibm,platform-facilities:4294967286
        ...
        mobility: added node /ibm,platform-facilities:4294967286
      
      Note we receive a single "add" message for the entire hierarchy, and what
      we receive from the ibm,configure-connector sequence is the top-level
      platform-facilities node along with its three children. The debug message
      simply reports the parent node and not the whole subtree.
      
      Also, significantly, the nodes added are almost completely equivalent to
      the ones removed; even phandles are unchanged. ibm,shared-interrupt-pool in
      the leaf nodes is the only property I've observed to differ, and Linux does
      not use that. So in practice, the sum of update messages Linux receives for
      this hierarchy is equivalent to minor property updates.
      
      We succeed in removing the original hierarchy from the device tree. But the
      vio bus code is ignorant of this, and does not unbind or relinquish its
      references. The leaf nodes, still reachable through sysfs, of course still
      refer to the now-freed ibm,platform-facilities parent node, which makes
      use-after-free possible:
      
        refcount_t: addition on 0; use-after-free.
        WARNING: CPU: 3 PID: 1706 at lib/refcount.c:25 refcount_warn_saturate+0x164/0x1f0
        refcount_warn_saturate+0x160/0x1f0 (unreliable)
        kobject_get+0xf0/0x100
        of_node_get+0x30/0x50
        of_get_parent+0x50/0xb0
        of_fwnode_get_parent+0x54/0x90
        fwnode_count_parents+0x50/0x150
        fwnode_full_name_string+0x30/0x110
        device_node_string+0x49c/0x790
        vsnprintf+0x1c0/0x4c0
        sprintf+0x44/0x60
        devspec_show+0x34/0x50
        dev_attr_show+0x40/0xa0
        sysfs_kf_seq_show+0xbc/0x200
        kernfs_seq_show+0x44/0x60
        seq_read_iter+0x2a4/0x740
        kernfs_fop_read_iter+0x254/0x2e0
        new_sync_read+0x120/0x190
        vfs_read+0x1d0/0x240
      
      Moreover, the "new" replacement subtree is not correctly added to the
      device tree, resulting in ibm,platform-facilities parent node without the
      appropriate leaf nodes, and broken symlinks in the sysfs device hierarchy:
      
        $ tree -d /sys/firmware/devicetree/base/ibm,platform-facilities/
        /sys/firmware/devicetree/base/ibm,platform-facilities/
      
        0 directories
      
        $ cd /sys/devices/vio ; find . -xtype l -exec file {} +
        ./ibm,sym-encryption-v1/of_node: broken symbolic link to
          ../../../firmware/devicetree/base/ibm,platform-facilities/ibm,sym-encryption-v1
        ./ibm,random-v1/of_node:         broken symbolic link to
          ../../../firmware/devicetree/base/ibm,platform-facilities/ibm,random-v1
        ./ibm,compression-v1/of_node:    broken symbolic link to
          ../../../firmware/devicetree/base/ibm,platform-facilities/ibm,compression-v1
      
      This is because add_dt_node() -> dlpar_attach_node() attaches only the
      parent node returned from configure-connector, ignoring any children. This
      should be corrected for the general case, but fixing that won't help with
      the stale OF node references, which is the more urgent problem.
      
      One way to address that would be to make the drivers respond to node
      removal notifications, so that node references can be dropped
      appropriately. But this would likely force the drivers to disrupt active
      clients for no useful purpose: equivalent nodes are immediately re-added.
      And recall that the acceleration capabilities described by the nodes remain
      available throughout the whole process.
      
      The solution I believe to be robust for this situation is to convert
      remove+add of a node with an unchanged phandle to an update of the node's
      properties in the Linux device tree structure. That would involve changing
      and adding a fair amount of code, and may take several iterations to land.
      
      Until that can be realized we have a confirmed use-after-free and the
      possibility of memory corruption. So add a limited workaround that
      discriminates on the node type, ignoring adds and removes. This should be
      amenable to backporting in the meantime.
      
      Fixes: 410bccf9
      
       ("powerpc/pseries: Partition migration in the kernel")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarNathan Lynch <nathanl@linux.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20211020194703.2613093-1-nathanl@linux.ibm.com
      319fa1a5
    • Christophe Leroy's avatar
      powerpc/32: Don't use a struct based type for pte_t · c7d19189
      Christophe Leroy authored
      Long time ago we had a config item called STRICT_MM_TYPECHECKS
      to build the kernel with pte_t defined as a structure in order
      to perform additional build checks or build it with pte_t
      defined as a simple type in order to get simpler generated code.
      
      Commit 670eea92 ("powerpc/mm: Always use STRICT_MM_TYPECHECKS")
      made the struct based definition the only one, considering that the
      generated code was similar in both cases.
      
      That's right on ppc64 because the ABI is such that the content of a
      struct having a single simple type element is passed as register,
      but on ppc32 such a structure is passed via the stack like any
      structure.
      
      Simple test function:
      
      	pte_t test(pte_t pte)
      	{
      		return pte;
      	}
      
      Before this patch we get
      
      	c00108ec <test>:
      	c00108ec:	81 24 00 00 	lwz     r9,0(r4)
      	c00108f0:	91 23 00 00 	stw     r9,0(r3)
      	c00108f4:	4e 80 00 20 	blr
      
      So, for PPC32, restore the simple type behaviour we got before
      commit 670eea92
      
      , but instead of adding a config option to
      activate type check, do it when __CHECKER__ is set so that type
      checking is performed by 'sparse' and provides feedback like:
      
      	arch/powerpc/mm/pgtable.c:466:16: warning: incorrect type in return expression (different base types)
      	arch/powerpc/mm/pgtable.c:466:16:    expected unsigned long
      	arch/powerpc/mm/pgtable.c:466:16:    got struct pte_t [usertype] x
      
      With this patch we now get
      
      	c0010890 <test>:
      	c0010890:	4e 80 00 20 	blr
      
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      [mpe: Define STRICT_MM_TYPECHECKS rather than repeating the condition]
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/c904599f33aaf6bb7ee2836a9ff8368509e0d78d.1631887042.git.christophe.leroy@csgroup.eu
      c7d19189
    • Christophe Leroy's avatar
      powerpc/breakpoint: Cleanup · a61ec782
      Christophe Leroy authored
      
      
      cache_op_size() does exactly the same as l1_dcache_bytes().
      
      Remove it.
      
      MSR_64BIT already exists, no need to enclode the check
      around #ifdef __powerpc64__
      
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/6184b08088312a7d787d450eb902584e4ae77f7a.1632317816.git.christophe.leroy@csgroup.eu
      a61ec782
    • Christophe Leroy's avatar
      powerpc: Activate CONFIG_STRICT_KERNEL_RWX by default · fdacae8a
      Christophe Leroy authored
      
      
      CONFIG_STRICT_KERNEL_RWX should be set by default on every
      architectures (See https://github.com/KSPP/linux/issues/4)
      
      On PPC32 we have to find a compromise between performance and/or
      memory wasting and selection of strict_kernel_rwx, because it implies
      either smaller memory chunks or larger alignment between RO memory
      and RW memory.
      
      For instance the 8xx maps memory with 8M pages. So either the limit
      between RO and RW must be 8M aligned or it falls back or 512k pages
      which implies more pressure on the TLB.
      
      book3s/32 maps memory with BATs as much as possible. BATS can have
      any power-of-two size between 128k and 256M but we have only 4 to 8
      BATs so the alignment must be good enough to allow efficient use of
      the BATs and avoid falling back on standard page mapping which would
      kill performance.
      
      So let's go one step forward and make it the default but still allow
      users to unset it when wanted.
      
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/057c40164084bfc7d77c0b2ff78d95dbf6a2a21b.1632503622.git.christophe.leroy@csgroup.eu
      fdacae8a
    • Christophe Leroy's avatar
      powerpc/8xx: Simplify TLB handling · 63f501e0
      Christophe Leroy authored
      In the old days, TLB handling for 8xx was using tlbie and tlbia
      instructions directly as much as possible.
      
      But commit f048aace
      
       ("powerpc/mm: Add SMP support to no-hash
      TLB handling") broke that by introducing out-of-line unnecessary
      complex functions for booke/smp which don't have tlbie/tlbia
      instructions and require more complex handling.
      
      Restore direct use of tlbie and tlbia for 8xx which is never SMP.
      
      With this patch we now get
      
      	c00ecc68 <ptep_clear_flush>:
      	c00ecc68:	39 00 00 00 	li      r8,0
      	c00ecc6c:	81 46 00 00 	lwz     r10,0(r6)
      	c00ecc70:	91 06 00 00 	stw     r8,0(r6)
      	c00ecc74:	7c 00 2a 64 	tlbie   r5,r0
      	c00ecc78:	7c 00 04 ac 	hwsync
      	c00ecc7c:	91 43 00 00 	stw     r10,0(r3)
      	c00ecc80:	4e 80 00 20 	blr
      
      Before it was
      
      	c0012880 <local_flush_tlb_page>:
      	c0012880:	2c 03 00 00 	cmpwi   r3,0
      	c0012884:	41 82 00 54 	beq     c00128d8 <local_flush_tlb_page+0x58>
      	c0012888:	81 22 00 00 	lwz     r9,0(r2)
      	c001288c:	81 43 00 20 	lwz     r10,32(r3)
      	c0012890:	39 29 00 01 	addi    r9,r9,1
      	c0012894:	91 22 00 00 	stw     r9,0(r2)
      	c0012898:	2c 0a 00 00 	cmpwi   r10,0
      	c001289c:	41 82 00 10 	beq     c00128ac <local_flush_tlb_page+0x2c>
      	c00128a0:	81 2a 01 dc 	lwz     r9,476(r10)
      	c00128a4:	2c 09 ff ff 	cmpwi   r9,-1
      	c00128a8:	41 82 00 0c 	beq     c00128b4 <local_flush_tlb_page+0x34>
      	c00128ac:	7c 00 22 64 	tlbie   r4,r0
      	c00128b0:	7c 00 04 ac 	hwsync
      	c00128b4:	81 22 00 00 	lwz     r9,0(r2)
      	c00128b8:	39 29 ff ff 	addi    r9,r9,-1
      	c00128bc:	2c 09 00 00 	cmpwi   r9,0
      	c00128c0:	91 22 00 00 	stw     r9,0(r2)
      	c00128c4:	4c a2 00 20 	bclr+   4,eq
      	c00128c8:	81 22 00 70 	lwz     r9,112(r2)
      	c00128cc:	71 29 00 04 	andi.   r9,r9,4
      	c00128d0:	4d 82 00 20 	beqlr
      	c00128d4:	48 65 76 74 	b       c0669f48 <preempt_schedule>
      	c00128d8:	81 22 00 00 	lwz     r9,0(r2)
      	c00128dc:	39 29 00 01 	addi    r9,r9,1
      	c00128e0:	91 22 00 00 	stw     r9,0(r2)
      	c00128e4:	4b ff ff c8 	b       c00128ac <local_flush_tlb_page+0x2c>
      ...
      	c00ecdc8 <ptep_clear_flush>:
      	c00ecdc8:	94 21 ff f0 	stwu    r1,-16(r1)
      	c00ecdcc:	39 20 00 00 	li      r9,0
      	c00ecdd0:	93 c1 00 08 	stw     r30,8(r1)
      	c00ecdd4:	83 c6 00 00 	lwz     r30,0(r6)
      	c00ecdd8:	91 26 00 00 	stw     r9,0(r6)
      	c00ecddc:	93 e1 00 0c 	stw     r31,12(r1)
      	c00ecde0:	7c 08 02 a6 	mflr    r0
      	c00ecde4:	7c 7f 1b 78 	mr      r31,r3
      	c00ecde8:	7c 83 23 78 	mr      r3,r4
      	c00ecdec:	7c a4 2b 78 	mr      r4,r5
      	c00ecdf0:	90 01 00 14 	stw     r0,20(r1)
      	c00ecdf4:	4b f2 5a 8d 	bl      c0012880 <local_flush_tlb_page>
      	c00ecdf8:	93 df 00 00 	stw     r30,0(r31)
      	c00ecdfc:	7f e3 fb 78 	mr      r3,r31
      	c00ece00:	80 01 00 14 	lwz     r0,20(r1)
      	c00ece04:	83 c1 00 08 	lwz     r30,8(r1)
      	c00ece08:	83 e1 00 0c 	lwz     r31,12(r1)
      	c00ece0c:	7c 08 03 a6 	mtlr    r0
      	c00ece10:	38 21 00 10 	addi    r1,r1,16
      	c00ece14:	4e 80 00 20 	blr
      
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/fb324f1c8f2ddb57cf6aad1cea26329558f1c1c0.1631887021.git.christophe.leroy@csgroup.eu
      63f501e0
    • Christophe Leroy's avatar
      powerpc/lib/sstep: Don't use __{get/put}_user() on kernel addresses · e28d0b67
      Christophe Leroy authored
      
      
      In the old days, when we didn't have kernel userspace access
      protection and had set_fs(), it was wise to use __get_user()
      and friends to read kernel memory.
      
      Nowadays, get_user() and put_user() are granting userspace access and
      are exclusively for userspace access.
      
      Convert single step emulation functions to user_access_begin() and
      friends and use unsafe_get_user() and unsafe_put_user().
      
      When addressing kernel addresses, there is no need to open userspace
      access. And for book3s/32 it is particularly important to no try and
      open userspace access on kernel address, because that would break the
      content of kernel space segment registers. No guard has been put
      against that risk in order to avoid degrading performance.
      
      copy_from_kernel_nofault() and copy_to_kernel_nofault() should
      be used but they are out-of-line functions which would degrade
      performance. Those two functions are making use of
      __get_kernel_nofault() and __put_kernel_nofault() macros.
      Those two macros are just wrappers behind __get_user_size_goto() and
      __put_user_size_goto().
      
      unsafe_get_user() and unsafe_put_user() are also wrappers of
      __get_user_size_goto() and __put_user_size_goto(). Use them to
      access kernel space. That allows refactoring userspace and
      kernelspace access.
      
      Reported-by: default avatarStan Johnson <userm57@yahoo.com>
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Depends-on: 4fe5cda9
      
       ("powerpc/uaccess: Implement user_read_access_begin and user_write_access_begin")
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/22831c9d17f948680a12c5292e7627288b15f713.1631817805.git.christophe.leroy@csgroup.eu
      e28d0b67
    • Christophe Leroy's avatar
      powerpc: warn on emulation of dcbz instruction in kernel mode · cbe654c7
      Christophe Leroy authored
      
      
      dcbz instruction shouldn't be used on non-cached memory. Using
      it on non-cached memory can result in alignment exception and
      implies a heavy handling.
      
      Instead of silentely emulating the instruction and resulting in high
      performance degradation, warn whenever an alignment exception is
      taken in kernel mode due to dcbz, so that the user is made aware that
      dcbz instruction has been used unexpectedly by the kernel.
      
      Reported-by: default avatarStan Johnson <userm57@yahoo.com>
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/2e3acfe63d289c6fba366e16973c9ab8369e8b75.1631803922.git.christophe.leroy@csgroup.eu
      cbe654c7
    • Christophe Leroy's avatar
      powerpc/32: Add support for out-of-line static calls · 5c810ced
      Christophe Leroy authored
      
      
      Add support for out-of-line static calls on PPC32. This change
      improve performance of calls to global function pointers by
      using direct calls instead of indirect calls.
      
      The trampoline is initialy populated with a 'blr' or branch to target,
      followed by an unreachable long jump sequence.
      
      In order to cater with parallele execution, the trampoline needs to
      be updated in a way that ensures it remains consistent at all time.
      This means we can't use the traditional lis/addi to load r12 with
      the target address, otherwise there would be a window during which
      the first instruction contains the upper part of the new target
      address while the second instruction still contains the lower part of
      the old target address. To avoid that the target address is stored
      just after the 'bctr' and loaded from there with a single instruction.
      
      Then, depending on the target distance, arch_static_call_transform()
      will either replace the first instruction by a direct 'bl <target>' or
      'nop' in order to have the trampoline fall through the long jump
      sequence.
      
      For the special case of __static_call_return0(), to avoid the risk of
      a far branch, a version of it is inlined at the end of the trampoline.
      
      Performancewise the long jump sequence is probably not better than
      the indirect calls set by GCC when we don't use static calls, but
      such calls are unlikely to be required on powerpc32: With most
      configurations the kernel size is far below 32 Mbytes so only
      modules may happen to be too far. And even modules are likely to
      be close enough as they are allocated below the kernel core and
      as close as possible of the kernel text.
      
      static_call selftest is running successfully with this change.
      
      With this patch, __do_irq() has the following sequence to trace
      irq entries:
      
      	c0004a00 <__SCT__tp_func_irq_entry>:
      	c0004a00:	48 00 00 e0 	b       c0004ae0 <__traceiter_irq_entry>
      	c0004a04:	3d 80 c0 00 	lis     r12,-16384
      	c0004a08:	81 8c 4a 1c 	lwz     r12,18972(r12)
      	c0004a0c:	7d 89 03 a6 	mtctr   r12
      	c0004a10:	4e 80 04 20 	bctr
      	c0004a14:	38 60 00 00 	li      r3,0
      	c0004a18:	4e 80 00 20 	blr
      	c0004a1c:	00 00 00 00 	.long 0x0
      ...
      	c0005654 <__do_irq>:
      ...
      	c0005664:	7c 7f 1b 78 	mr      r31,r3
      ...
      	c00056a0:	81 22 00 00 	lwz     r9,0(r2)
      	c00056a4:	39 29 00 01 	addi    r9,r9,1
      	c00056a8:	91 22 00 00 	stw     r9,0(r2)
      	c00056ac:	3d 20 c0 af 	lis     r9,-16209
      	c00056b0:	81 29 74 cc 	lwz     r9,29900(r9)
      	c00056b4:	2c 09 00 00 	cmpwi   r9,0
      	c00056b8:	41 82 00 10 	beq     c00056c8 <__do_irq+0x74>
      	c00056bc:	80 69 00 04 	lwz     r3,4(r9)
      	c00056c0:	7f e4 fb 78 	mr      r4,r31
      	c00056c4:	4b ff f3 3d 	bl      c0004a00 <__SCT__tp_func_irq_entry>
      
      Before this patch, __do_irq() was doing the following to trace irq
      entries:
      
      	c0005700 <__do_irq>:
      ...
      	c0005710:	7c 7e 1b 78 	mr      r30,r3
      ...
      	c000574c:	93 e1 00 0c 	stw     r31,12(r1)
      	c0005750:	81 22 00 00 	lwz     r9,0(r2)
      	c0005754:	39 29 00 01 	addi    r9,r9,1
      	c0005758:	91 22 00 00 	stw     r9,0(r2)
      	c000575c:	3d 20 c0 af 	lis     r9,-16209
      	c0005760:	83 e9 f4 cc 	lwz     r31,-2868(r9)
      	c0005764:	2c 1f 00 00 	cmpwi   r31,0
      	c0005768:	41 82 00 24 	beq     c000578c <__do_irq+0x8c>
      	c000576c:	81 3f 00 00 	lwz     r9,0(r31)
      	c0005770:	80 7f 00 04 	lwz     r3,4(r31)
      	c0005774:	7d 29 03 a6 	mtctr   r9
      	c0005778:	7f c4 f3 78 	mr      r4,r30
      	c000577c:	4e 80 04 21 	bctrl
      	c0005780:	85 3f 00 0c 	lwzu    r9,12(r31)
      	c0005784:	2c 09 00 00 	cmpwi   r9,0
      	c0005788:	40 82 ff e4 	bne     c000576c <__do_irq+0x6c>
      
      Behind the fact of now using a direct 'bl' instead of a
      'load/mtctr/bctr' sequence, we can also see that we get one less
      register on the stack.
      
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/6ec2a7865ed6a5ec54ab46d026785bafe1d837ea.1630484892.git.christophe.leroy@csgroup.eu
      5c810ced
    • Christophe Leroy's avatar
      powerpc/machdep: Remove stale functions from ppc_md structure · 8f7fadb4
      Christophe Leroy authored
      ppc_md.iommu_save() is not set anymore by any platform after
      commit c40785ad ("powerpc/dart: Use a cachable DART").
      So iommu_save() has become a nop and can be removed.
      
      ppc_md.show_percpuinfo() is not set anymore by any platform after
      commit 4350147a ("[PATCH] ppc64: SMU based macs cpufreq support").
      
      Last users of ppc_md.rtc_read_val() and ppc_md.rtc_write_val() were
      removed by commit 0f03a43b ("[POWERPC] Remove todc code from
      ARCH=powerpc")
      
      Last user of kgdb_map_scc() was removed by commit 17ce452f ("kgdb,
      powerpc: arch specific powerpc kgdb support").
      
      ppc.machine_kexec_prepare() has not been used since
      commit 8ee3e0d6
      
       ("powerpc: Remove the main legacy iSerie platform
      code"). This allows the removal of machine_kexec_prepare() and the
      rename of default_machine_kexec_prepare() into machine_kexec_prepare()
      
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Reviewed-by: default avatarDaniel Axtens <dja@axtens.net>
      [mpe: Drop prototype for default_machine_kexec_prepare() as noted by dja]
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/24d4ca0ada683c9436a5f812a7aeb0a1362afa2b.1630398606.git.christophe.leroy@csgroup.eu
      8f7fadb4
    • Christophe Leroy's avatar
      powerpc/time: Remove generic_suspend_{dis/en}able_irqs() · e606a2f4
      Christophe Leroy authored
      Commit d75d68cf
      
       ("powerpc: Clean up obsolete code relating to
      decrementer and timebase") made generic_suspend_enable_irqs() and
      generic_suspend_disable_irqs() static.
      
      Fold them into their only caller.
      
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Reviewed-by: default avatarDaniel Axtens <dja@axtens.net>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/c3f9ec9950394ef939014f7934268e6ee30ca04f.1630398566.git.christophe.leroy@csgroup.eu
      e606a2f4
    • Christophe Leroy's avatar
      powerpc/audit: Convert powerpc to AUDIT_ARCH_COMPAT_GENERIC · 566af8cd
      Christophe Leroy authored
      Commit e65e1fc2 ("[PATCH] syscall class hookup for all normal
      targets") added generic support for AUDIT but that didn't include
      support for bi-arch like powerpc.
      
      Commit 4b588411
      
       ("audit: Add generic compat syscall support")
      added generic support for bi-arch.
      
      Convert powerpc to that bi-arch generic audit support.
      
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/a4b3951d1191d4183d92a07a6097566bde60d00a.1629812058.git.christophe.leroy@csgroup.eu
      566af8cd
    • Christophe Leroy's avatar
      powerpc/32: Don't use lmw/stmw for saving/restoring non volatile regs · a85c728c
      Christophe Leroy authored
      
      
      Instructions lmw/stmw are interesting for functions that are rarely
      used and not in the cache, because only one instruction is to be
      copied into the instruction cache instead of 19. However those
      instruction are less performant than 19x raw lwz/stw as they require
      synchronisation plus one additional cycle.
      
      SAVE_NVGPRS / REST_NVGPRS are used in only a few places which are
      mostly in interrupts entries/exits and in task switch so they are
      likely already in the cache.
      
      Using standard lwz improves null_syscall selftest by:
      - 10 cycles on mpc832x.
      - 2 cycles on mpc8xx.
      
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/316c543b8906712c108985c8463eec09c8db577b.1629732542.git.christophe.leroy@csgroup.eu
      a85c728c
    • Anatolij Gustschin's avatar
      powerpc/5200: dts: fix memory node unit name · aed2886a
      Anatolij Gustschin authored
      
      
      Fixes build warnings:
      Warning (unit_address_vs_reg): /memory: node has a reg or ranges property, but no unit name
      
      Signed-off-by: default avatarAnatolij Gustschin <agust@denx.de>
      Reviewed-by: default avatarRob Herring <robh@kernel.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20211013220532.24759-4-agust@denx.de
      aed2886a
    • Anatolij Gustschin's avatar
      powerpc/5200: dts: fix pci ranges warnings · 7855b6c6
      Anatolij Gustschin authored
      
      
      Fix ranges property warnings:
      pci@f0000d00:ranges: 'oneOf' conditional failed, one must be fixed:
      
      Signed-off-by: default avatarAnatolij Gustschin <agust@denx.de>
      Reviewed-by: default avatarRob Herring <robh@kernel.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20211013220532.24759-3-agust@denx.de
      7855b6c6
    • Anatolij Gustschin's avatar
      powerpc/5200: dts: add missing pci ranges · e9efabc6
      Anatolij Gustschin authored
      
      
      Add ranges property to fix build warnings:
      Warning (pci_bridge): /pci@f0000d00: missing ranges for PCI bridge (or not a bridge)
      
      Signed-off-by: default avatarAnatolij Gustschin <agust@denx.de>
      Reviewed-by: default avatarRob Herring <robh@kernel.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20211013220532.24759-2-agust@denx.de
      e9efabc6
    • Gustavo A. R. Silva's avatar
      powerpc/vas: Fix potential NULL pointer dereference · 61cb9ac6
      Gustavo A. R. Silva authored
      (!ptr && !ptr->foo) strikes again. :)
      
      The expression (!ptr && !ptr->foo) is bogus and in case ptr is NULL,
      it leads to a NULL pointer dereference: ptr->foo.
      
      Fix this by converting && to ||
      
      This issue was detected with the help of Coccinelle, and audited and
      fixed manually.
      
      Fixes: 1a0d0d5e
      
       ("powerpc/vas: Add platform specific user window operations")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarGustavo A. R. Silva <gustavoars@kernel.org>
      Reviewed-by: default avatarTyrel Datwyler <tyreld@linux.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20211015050345.GA1161918@embeddedor
      61cb9ac6
    • Christophe Leroy's avatar
      powerpc/fsl_booke: Enable STRICT_KERNEL_RWX · 49e3d8ea
      Christophe Leroy authored
      
      
      Enable STRICT_KERNEL_RWX on fsl_booke.
      
      For that, we need additional TLBCAMs dedicated to linear mapping,
      based on the alignment of _sinittext.
      
      By default, up to 768 Mbytes of memory are mapped.
      It uses 3 TLBCAMs of size 256 Mbytes.
      
      With a data alignment of 16, we need up to 9 TLBCAMs:
        16/16/16/16/64/64/64/256/256
      
      With a data alignment of 4, we need up to 12 TLBCAMs:
        4/4/4/4/16/16/16/64/64/64/256/256
      
      With a data alignment of 1, we need up to 15 TLBCAMs:
        1/1/1/1/4/4/4/16/16/16/64/64/64/256/256
      
      By default, set a 16 Mbytes alignment as a compromise between memory
      usage and number of TLBCAMs. This can be adjusted manually when needed.
      
      For the time being, it doens't work when the base is randomised.
      
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/29f9e5d2bbbc83ae9ca879265426a6278bf4d5bb.1634292136.git.christophe.leroy@csgroup.eu
      49e3d8ea
    • Christophe Leroy's avatar
      powerpc/fsl_booke: Update of TLBCAMs after init · d5970045
      Christophe Leroy authored
      
      
      After init, set readonly memory as ROX and set readwrite
      memory as RWX, if STRICT_KERNEL_RWX is enabled.
      
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/66bef0b9c273e1121706883f3cf5ad0a053d863f.1634292136.git.christophe.leroy@csgroup.eu
      d5970045
    • Christophe Leroy's avatar
      powerpc/fsl_booke: Allocate separate TLBCAMs for readonly memory · 0b2859a7
      Christophe Leroy authored
      
      
      Reorganise TLBCAM allocation so that when STRICT_KERNEL_RWX is
      enabled, TLBCAMs are allocated such that readonly memory uses
      different TLBCAMs.
      
      This results in an allocation looking like:
      
      Memory CAM mapping: 4/4/4/1/1/1/1/16/16/16/64/64/64/256/256 Mb, residual: 256Mb
      
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/8ca169bc288261a0e0558712f979023c3a960ebb.1634292136.git.christophe.leroy@csgroup.eu
      0b2859a7
    • Christophe Leroy's avatar
      powerpc/fsl_booke: Tell map_mem_in_cams() if init is done · 52bda69a
      Christophe Leroy authored
      
      
      In order to be able to call map_mem_in_cams() once more
      after init for STRICT_KERNEL_RWX, add an argument.
      
      For now, map_mem_in_cams() is always called only during init.
      
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/3b69a7e0b393b16984ade882a5eae5d727717459.1634292136.git.christophe.leroy@csgroup.eu
      52bda69a
    • Christophe Leroy's avatar
      powerpc/fsl_booke: Enable reloading of TLBCAM without switching to AS1 · a97dd9e2
      Christophe Leroy authored
      
      
      Avoid switching to AS1 when reloading TLBCAM after init for
      STRICT_KERNEL_RWX.
      
      When we setup AS1 we expect the entire accessible memory to be mapped
      through one entry, this is not the case anymore at the end of init.
      
      We are not changing the size of TLBCAMs, only flags, so no need to
      switch to AS1.
      
      So change loadcam_multi() to not switch to AS1 when the given
      temporary tlb entry in 0.
      
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/a9d517fbfbc940f56103c46b323f6eb8f4485571.1634292136.git.christophe.leroy@csgroup.eu
      a97dd9e2
    • Christophe Leroy's avatar
      powerpc/fsl_booke: Take exec flag into account when setting TLBCAMs · 01116e6e
      Christophe Leroy authored
      
      
      Don't force MAS3_SX and MAS3_UX at all time. Take into account the
      exec flag.
      
      While at it, fix a couple of closeby style problems (indent with space
      and unnecessary parenthesis), it keeps more readability.
      
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/5467044e59f27f9fcf709b9661779e3ce5f784f6.1634292136.git.christophe.leroy@csgroup.eu
      01116e6e
    • Christophe Leroy's avatar
      powerpc/fsl_booke: Rename fsl_booke.c to fsl_book3e.c · 3a75fd70
      Christophe Leroy authored
      
      
      We have a myriad of CONFIG symbols around different variants
      of BOOKEs, which would be worth tidying up one day.
      
      But at least, make file names and CONFIG option match:
      
      We have CONFIG_FSL_BOOKE and CONFIG_PPC_FSL_BOOK3E.
      
      fsl_booke.c is selected by and only by CONFIG_PPC_FSL_BOOK3E.
      So rename it fsl_book3e to reduce confusion.
      
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/5dc871db1f67739319bec11f049ca450da1c13a2.1634292136.git.christophe.leroy@csgroup.eu
      3a75fd70
    • Christophe Leroy's avatar
      powerpc/booke: Disable STRICT_KERNEL_RWX, DEBUG_PAGEALLOC and KFENCE · 68b44f94
      Christophe Leroy authored
      fsl_booke and 44x are not able to map kernel linear memory with
      pages, so they can't support DEBUG_PAGEALLOC and KFENCE, and
      STRICT_KERNEL_RWX is also a problem for now.
      
      Enable those only on book3s (both 32 and 64 except KFENCE), 8xx and 40x.
      
      Fixes: 88df6e90 ("[POWERPC] DEBUG_PAGEALLOC for 32-bit")
      Fixes: 95902e6c ("powerpc/mm: Implement STRICT_KERNEL_RWX on PPC32")
      Fixes: 90cbac0e
      
       ("powerpc: Enable KFENCE for PPC32")
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/d1ad9fdd9b27da3fdfa16510bb542ed51fa6e134.1634292136.git.christophe.leroy@csgroup.eu
      68b44f94
    • Wan Jiabing's avatar
      powerpc/kexec_file: Add of_node_put() before goto · 7453f501
      Wan Jiabing authored
      
      
      Fix following coccicheck warning:
      ./arch/powerpc/kexec/file_load_64.c:698:1-22: WARNING: Function
      for_each_node_by_type should have of_node_put() before goto
      
      Early exits from for_each_node_by_type should decrement the
      node reference counter.
      
      Signed-off-by: default avatarWan Jiabing <wanjiabing@vivo.com>
      Acked-by: default avatarHari Bathini <hbathini@linux.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20211018015418.10182-1-wanjiabing@vivo.com
      7453f501
    • Wan Jiabing's avatar
      powerpc/pseries/iommu: Add of_node_put() before break · 915b368f
      Wan Jiabing authored
      
      
      Fix following coccicheck warning:
      
      ./arch/powerpc/platforms/pseries/iommu.c:924:1-28: WARNING: Function
      for_each_node_with_property should have of_node_put() before break
      
      Early exits from for_each_node_with_property should decrement the
      node reference counter.
      
      Signed-off-by: default avatarWan Jiabing <wanjiabing@vivo.com>
      Reviewed-by: default avatarLeonardo Bras <leobras.c@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20211014075624.16344-1-wanjiabing@vivo.com
      915b368f
    • Joel Stanley's avatar
      powerpc/s64: Clarify that radix lacks DEBUG_PAGEALLOC · 4f703e7f
      Joel Stanley authored
      
      
      The page_alloc.c code will call into __kernel_map_pages() when
      DEBUG_PAGEALLOC is configured and enabled.
      
      As the implementation assumes hash, this should crash spectacularly if
      not for a bit of luck in __kernel_map_pages(). In this function
      linear_map_hash_count is always zero, the for loop exits without doing
      any damage.
      
      There are no other platforms that determine if they support
      debug_pagealloc at runtime. Instead of adding code to mm/page_alloc.c to
      do that, this change turns the map/unmap into a noop when in radix
      mode and prints a warning once.
      
      Signed-off-by: default avatarJoel Stanley <joel@jms.id.au>
      Reviewed-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      [mpe: Reformat if per Christophe's suggestion]
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20211013213438.675095-1-joel@jms.id.au
      4f703e7f
  2. Oct 14, 2021
    • Christophe Leroy's avatar
      powerpc: Mark .opd section read-only · 3091f5fc
      Christophe Leroy authored
      
      
      .opd section contains function descriptors used to locate
      functions in the kernel. If someone is able to modify a
      function descriptor he will be able to run arbitrary
      kernel function instead of another.
      
      To avoid that, move .opd section inside read-only memory.
      
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/3cd40b682fb6f75bb40947b55ca0bac20cb3f995.1634136222.git.christophe.leroy@csgroup.eu
      3091f5fc
    • Athira Rajeev's avatar
      powerpc/perf: Fix cycles/instructions as PM_CYC/PM_INST_CMPL in power10 · 8f6aca0e
      Athira Rajeev authored
      On power9 and earlier platforms, the default event used for cyles and
      instructions is PM_CYC (0x0001e) and PM_INST_CMPL (0x00002)
      respectively. These events use two programmable PMCs and by default will
      count irrespective of the run latch state (idle state). But since they
      use programmable PMCs, these events can lead to multiplexing with other
      events, because there are only 4 programmable PMCs. Hence in power10,
      performance monitoring unit (PMU) driver uses performance monitor
      counter 5 (PMC5) and performance monitor counter6 (PMC6) for counting
      instructions and cycles.
      
      Currently on power10, the event used for cycles is PM_RUN_CYC (0x600F4)
      and instructions uses PM_RUN_INST_CMPL (0x500fa). But counting of these
      events in idle state is controlled by the CC56RUN bit setting in Monitor
      Mode Control Register0 (MMCR0). If the CC56RUN bit is zero, PMC5/6 will
      not count when CTRL[RUN] (run latch) is zero. This could lead to missing
      some counts if a thread is in idle state during system wide profiling.
      
      To fix it, set the CC56RUN bit in MMCR0 for power10, which makes PMC5
      and PMC6 count instructions and cycles regardless of the run latch
      state. Since this change make PMC5/6 count as PM_INST_CMPL/PM_CYC,
      rename the event code 0x600f4 as PM_CYC instead of PM_RUN_CYC and event
      code 0x500fa as PM_INST_CMPL instead of PM_RUN_INST_CMPL. The changes
      are only for PMC5/6 event codes and will not affect the behaviour of
      PM_RUN_CYC/PM_RUN_INST_CMPL if progammed in other PMC's.
      
      Fixes: a64e697c
      
       ("powerpc/perf: power10 Performance Monitoring support")
      Signed-off-by: default avatarAthira Rajeev <atrajeev@linux.vnet.ibm.cm>
      Reviewed-by: default avatarMadhavan Srinivasan <maddy@linux.ibm.com>
      [mpe: Tweak change log wording for style and consistency]
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20211007075121.28497-1-atrajeev@linux.vnet.ibm.com
      8f6aca0e
  3. Oct 13, 2021
    • Kai Song's avatar
      powerpc/eeh: Fix docstrings in eeh.c · b616230e
      Kai Song authored
      
      
      We fix the following warnings when building kernel with W=1:
      arch/powerpc/kernel/eeh.c:598: warning: Function parameter or member 'function' not described in 'eeh_pci_enable'
      arch/powerpc/kernel/eeh.c:774: warning: Function parameter or member 'edev' not described in 'eeh_set_dev_freset'
      arch/powerpc/kernel/eeh.c:774: warning: expecting prototype for eeh_set_pe_freset(). Prototype was for eeh_set_dev_freset() instead
      arch/powerpc/kernel/eeh.c:814: warning: Function parameter or member 'include_passed' not described in 'eeh_pe_reset_full'
      arch/powerpc/kernel/eeh.c:944: warning: Function parameter or member 'ops' not described in 'eeh_init'
      arch/powerpc/kernel/eeh.c:1451: warning: Function parameter or member 'include_passed' not described in 'eeh_pe_reset'
      arch/powerpc/kernel/eeh.c:1526: warning: Function parameter or member 'func' not described in 'eeh_pe_inject_err'
      arch/powerpc/kernel/eeh.c:1526: warning: Excess function parameter 'function' described in 'eeh_pe_inject_err'
      
      Signed-off-by: default avatarKai Song <songkai01@inspur.com>
      Reviewed-by: default avatarDaniel Axtens <dja@axtens.net>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20211009041630.4135-1-songkai01@inspur.com
      b616230e
    • Cédric Le Goater's avatar
      powerpc/boot: Use CONFIG_PPC_POWERNV to compile OPAL support · 6ffeb56e
      Cédric Le Goater authored
      
      
      CONFIG_PPC64_BOOT_WRAPPER is selected by CPU_LITTLE_ENDIAN which is
      used to compile support for other platforms such as Microwatt. There
      is no need for OPAL calls on these.
      
      Signed-off-by: default avatarCédric Le Goater <clg@kaod.org>
      Reviewed-by: default avatarJoel Stanley <joel@jms.id.au>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20211011070356.99952-1-clg@kaod.org
      6ffeb56e
    • Christophe Leroy's avatar
      powerpc: Set max_mapnr correctly · 602946ec
      Christophe Leroy authored
      
      
      max_mapnr is used by virt_addr_valid() to check if a linear
      address is valid.
      
      It must only include lowmem PFNs, like other architectures.
      
      Problem detected on a system with 1G mem (Only 768M are mapped), with
      CONFIG_DEBUG_VIRTUAL and CONFIG_TEST_DEBUG_VIRTUAL, it didn't report
      virt_to_phys(VMALLOC_START), VMALLOC_START being 0xf1000000.
      
      Signed-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/77d99037782ac4b3c3b0124fc4ae80ce7b760b05.1634035228.git.christophe.leroy@csgroup.eu
      602946ec
  4. Oct 12, 2021
  5. Oct 08, 2021
    • Nathan Lynch's avatar
      powerpc/pseries/cpuhp: remove obsolete comment from pseries_cpu_die · f9473a65
      Nathan Lynch authored
      
      
      This comment likely refers to the obsolete DLPAR workflow where some
      resource state transitions were driven more directly from user space
      utilities, but it also seems to contradict itself: "Change isolate state to
      Isolate [...]" is at odds with the preceding sentences, and it does not
      relate at all to the code that follows.
      
      Remove it to prevent confusion.
      
      Signed-off-by: default avatarNathan Lynch <nathanl@linux.ibm.com>
      Reviewed-by: default avatarDaniel Henrique Barboza <danielhb413@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20210927201933.76786-5-nathanl@linux.ibm.com
      f9473a65
    • Nathan Lynch's avatar
      powerpc/pseries/cpuhp: delete add/remove_by_count code · fa2a5dfe
      Nathan Lynch authored
      The core DLPAR code supports two actions (add and remove) and three
      subtypes of action:
      
      * By DRC index: the action is attempted on a single specified resource.
        This is the usual case for processors.
      * By indexed count: the action is attempted on a range of resources
        beginning at the specified index. This is implemented only by the memory
        DLPAR code.
      * By count: the lower layer (CPU or memory) is responsible for locating the
        specified number of resources to which the action can be applied.
      
      I cannot find any evidence of the "by count" subtype being used by drmgr or
      qemu for processors. And when I try to exercise this code, the add case
      does not work:
      
        $ ppc64_cpu --smt ; nproc
        SMT=8
        24
        $ printf "cpu remove count 2" > /sys/kernel/dlpar
        $ nproc
        8
        $ printf "cpu add count 2" > /sys/kernel/dlpar
        -bash: printf: write error: Invalid argument
        $ dmesg | tail -2
        pseries-hotplug-cpu: Failed to find enough CPUs (1 of 2) to add
        dlpar: Could not handle DLPAR request "cpu add count 2"
        $ nproc
        8
        $ drmgr -c cpu -a -q 2         # this uses the by-index method
        Validating CPU DLPAR capability...yes.
        CPU 1
        CPU 17
        $ nproc
        24
      
      This is because find_drc_info_cpus_to_add() does not increment drc_index
      appropriately during its search.
      
      This is not hard to fix. But the _by_count() functions also have the
      property that they attempt to roll back all prior operations if the entire
      request cannot be satisfied, even though the rollback itself can encounter
      errors. It's not possible to provide transaction-like behavior at this
      level, and it's undesirable to have code that can only pretend to do that.
      Any users of these functions cannot know what the state of the system is in
      the error case. And the error paths are, to my knowledge, impossible to
      test without adding custom error injection code.
      
      Summary:
      
      * This code has not worked reliably since its introduction.
      * There is no evidence that it is used.
      * It contains questionable rollback behaviors in error paths which are
        difficult to test.
      
      So let's remove it.
      
      Fixes: ac713800 ("powerpc/pseries: Add CPU dlpar remove functionality")
      Fixes: 90edf184 ("powerpc/pseries: Add CPU dlpar add functionality")
      Fixes: b015f6bc
      
       ("powerpc/pseries: Add cpu DLPAR support for drc-info property")
      Signed-off-by: default avatarNathan Lynch <nathanl@linux.ibm.com>
      Tested-by: default avatarDaniel Henrique Barboza <danielhb413@gmail.com>
      Reviewed-by: default avatarDaniel Henrique Barboza <danielhb413@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20210927201933.76786-4-nathanl@linux.ibm.com
      fa2a5dfe
    • Nathan Lynch's avatar
      powerpc/cpuhp: BUG -> WARN conversion in offline path · 983f9101
      Nathan Lynch authored
      
      
      If, due to bugs elsewhere, we get into unregister_cpu_online() with a CPU
      that isn't marked hotpluggable, we can emit a warning and return an
      appropriate error instead of crashing.
      
      Signed-off-by: default avatarNathan Lynch <nathanl@linux.ibm.com>
      Reviewed-by: default avatarDaniel Henrique Barboza <danielhb413@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20210927201933.76786-3-nathanl@linux.ibm.com
      983f9101
    • Nathan Lynch's avatar
      powerpc/pseries/cpuhp: cache node corrections · 7edd5c9a
      Nathan Lynch authored
      On pseries, cache nodes in the device tree can be added and removed by the
      CPU DLPAR code as well as the partition migration (mobility) code. PowerVM
      partitions in dedicated processor mode typically have L2 and L3 cache
      nodes.
      
      The CPU DLPAR code has the following shortcomings:
      
      * Cache nodes returned as siblings of a new CPU node by
        ibm,configure-connector are silently discarded; only the CPU node is
        added to the device tree.
      
      * Cache nodes which become unreferenced in the processor removal path are
        not removed from the device tree. This can lead to duplicate nodes when
        the post-migration device tree update code replaces cache nodes.
      
      This is long-standing behavior. Presumably it has gone mostly unnoticed
      because the two bugs have the property of obscuring each other in common
      simple scenarios (e.g. remove a CPU and add it back). Likely you'd notice
      only if you cared to inspect the device tree or the sysfs cacheinfo
      information.
      
      Booted with two processors:
      
        $ pwd
        /sys/firmware/devicetree/base/cpus
        $ ls -1d */
        l2-cache@2010/
        l2-cache@2011/
        l3-cache@3110/
        l3-cache@3111/
        PowerPC,POWER9@0/
        PowerPC,POWER9@8/
        $ lsprop */l2-cache
        l2-cache@2010/l2-cache
                       00003110 (12560)
        l2-cache@2011/l2-cache
                       00003111 (12561)
        PowerPC,POWER9@0/l2-cache
                       00002010 (8208)
        PowerPC,POWER9@8/l2-cache
                       00002011 (8209)
        $ ls /sys/devices/system/cpu/cpu0/cache/
        index0  index1  index2  index3
      
      After DLPAR-adding PowerPC,POWER9@10, we see that its associated cache
      nodes are absent, its threads' L2+L3 cacheinfo is unpopulated, and it is
      missing a cache level in its sched domain hierarchy:
      
        $ ls -1d */
        l2-cache@2010/
        l2-cache@2011/
        l3-cache@3110/
        l3-cache@3111/
        PowerPC,POWER9@0/
        PowerPC,POWER9@10/
        PowerPC,POWER9@8/
        $ lsprop PowerPC\,POWER9@10/l2-cache
        PowerPC,POWER9@10/l2-cache
                       00002012 (8210)
        $ ls /sys/devices/system/cpu/cpu16/cache/
        index0  index1
        $ grep . /sys/kernel/debug/sched/domains/cpu{0,8,16}/domain*/name
        /sys/kernel/debug/sched/domains/cpu0/domain0/name:SMT
        /sys/kernel/debug/sched/domains/cpu0/domain1/name:CACHE
        /sys/kernel/debug/sched/domains/cpu0/domain2/name:DIE
        /sys/kernel/debug/sched/domains/cpu8/domain0/name:SMT
        /sys/kernel/debug/sched/domains/cpu8/domain1/name:CACHE
        /sys/kernel/debug/sched/domains/cpu8/domain2/name:DIE
        /sys/kernel/debug/sched/domains/cpu16/domain0/name:SMT
        /sys/kernel/debug/sched/domains/cpu16/domain1/name:DIE
      
      When removing PowerPC,POWER9@8, we see that its cache nodes are left
      behind:
      
        $ ls -1d */
        l2-cache@2010/
        l2-cache@2011/
        l3-cache@3110/
        l3-cache@3111/
        PowerPC,POWER9@0/
      
      When DLPAR is combined with VM migration, we can get duplicate nodes. E.g.
      removing one processor, then migrating, adding a processor, and then
      migrating again can result in warnings from the OF core during
      post-migration device tree updates:
      
        Duplicate name in cpus, renamed to "l2-cache@2011#1"
        Duplicate name in cpus, renamed to "l3-cache@3111#1"
      
      and nodes with duplicated phandles in the tree, making lookup behavior
      unpredictable:
      
        $ lsprop l[23]-cache@*/ibm,phandle
        l2-cache@2010/ibm,phandle
                         00002010 (8208)
        l2-cache@2011#1/ibm,phandle
                         00002011 (8209)
        l2-cache@2011/ibm,phandle
                         00002011 (8209)
        l3-cache@3110/ibm,phandle
                         00003110 (12560)
        l3-cache@3111#1/ibm,phandle
                         00003111 (12561)
        l3-cache@3111/ibm,phandle
                         00003111 (12561)
      
      Address these issues by:
      
      * Correctly processing siblings of the node returned from
        dlpar_configure_connector().
      * Removing cache nodes in the CPU remove path when it can be determined
        that they are not associated with other CPUs or caches.
      
      Use the of_changeset API in both cases, which allows us to keep the error
      handling in this code from becoming more complex while ensuring that the
      device tree cannot become inconsistent.
      
      Fixes: ac713800 ("powerpc/pseries: Add CPU dlpar remove functionality")
      Fixes: 90edf184
      
       ("powerpc/pseries: Add CPU dlpar add functionality")
      Signed-off-by: default avatarNathan Lynch <nathanl@linux.ibm.com>
      Tested-by: default avatarDaniel Henrique Barboza <danielhb413@gmail.com>
      Reviewed-by: default avatarDaniel Henrique Barboza <danielhb413@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20210927201933.76786-2-nathanl@linux.ibm.com
      7edd5c9a
    • Nathan Lynch's avatar
      powerpc/paravirt: correct preempt debug splat in vcpu_is_preempted() · fda0eb22
      Nathan Lynch authored
      vcpu_is_preempted() can be used outside of preempt-disabled critical
      sections, yielding warnings such as:
      
      BUG: using smp_processor_id() in preemptible [00000000] code: systemd-udevd/185
      caller is rwsem_spin_on_owner+0x1cc/0x2d0
      CPU: 1 PID: 185 Comm: systemd-udevd Not tainted 5.15.0-rc2+ #33
      Call Trace:
      [c000000012907ac0] [c000000000aa30a8] dump_stack_lvl+0xac/0x108 (unreliable)
      [c000000012907b00] [c000000001371f70] check_preemption_disabled+0x150/0x160
      [c000000012907b90] [c0000000001e0e8c] rwsem_spin_on_owner+0x1cc/0x2d0
      [c000000012907be0] [c0000000001e1408] rwsem_down_write_slowpath+0x478/0x9a0
      [c000000012907ca0] [c000000000576cf4] filename_create+0x94/0x1e0
      [c000000012907d10] [c00000000057ac08] do_symlinkat+0x68/0x1a0
      [c000000012907d70] [c00000000057ae18] sys_symlink+0x58/0x70
      [c000000012907da0] [c00000000002e448] system_call_exception+0x198/0x3c0
      [c000000012907e10] [c00000000000c54c] system_call_common+0xec/0x250
      
      The result of vcpu_is_preempted() is always used speculatively, and the
      function does not access per-cpu resources in a (Linux) preempt-unsafe way.
      Use raw_smp_processor_id() to avoid such warnings, adding explanatory
      comments.
      
      Fixes: ca3f969d
      
       ("powerpc/paravirt: Use is_kvm_guest() in vcpu_is_preempted()")
      Signed-off-by: default avatarNathan Lynch <nathanl@linux.ibm.com>
      Reviewed-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20210928214147.312412-3-nathanl@linux.ibm.com
      fda0eb22
    • Nathan Lynch's avatar
      powerpc/paravirt: vcpu_is_preempted() commentary · 799f9b51
      Nathan Lynch authored
      
      
      Add comments more clearly documenting that this function determines whether
      hypervisor-level preemption of the VM has occurred.
      
      Signed-off-by: default avatarNathan Lynch <nathanl@linux.ibm.com>
      Reviewed-by: default avatarSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20210928214147.312412-2-nathanl@linux.ibm.com
      799f9b51