Unverified Commit b0092ae9 authored by openeuler-ci-bot's avatar openeuler-ci-bot Committed by Gitee
Browse files

!11950 arm64/perf: Enable branch stack sampling

Merge Pull Request from: @hejunhao3 
 
```text
========== Perf Branch Stack Sampling Support (arm64 platforms) ===========

Currently arm64 platform does not support perf branch stack sampling. Hence
any event requesting for branch stack records i.e PERF_SAMPLE_BRANCH_STACK
marked in event->attr.sample_type, will be rejected in armpmu_event_init().

static int armpmu_event_init(struct perf_event *event)
{
	........
        /* does not support taken branch sampling */
        if (has_branch_stack(event))
                return -EOPNOTSUPP;
	........
}

$perf record -j any,u,k ls
Error:
cycles:P: PMU Hardware or event type doesn't support branch stack sampling.

-------------------- CONFIG_ARM64_BRBE and FEAT_BRBE ----------------------

After this series, perf branch stack sampling feature gets enabled on arm64
platforms where FEAT_BRBE HW feature is supported, and CONFIG_ARM64_BRBE is
also selected during build. Let's observe all all possible scenarios here.

1. Feature not built (!CONFIG_ARM64_BRBE):

Falls back to the current behaviour i.e event gets rejected.

2. Feature built but HW not supported (CONFIG_ARM64_BRBE && !FEAT_BRBE):

Falls back to the current behaviour i.e event gets rejected.

3. Feature built and HW supported (CONFIG_ARM64_BRBE && FEAT_BRBE):

Platform supports branch stack sampling requests. Let's observe through a
simple example here.

$perf record -j any_call,u,k,save_type ls

[Please refer perf-record man pages for all possible branch filter options]

$perf report
-------------------------- Snip ----------------------
# Overhead  Command  Source Shared Object  Source Symbol                                 Target Symbol                                 Basic Block Cycles
# ........  .......  ....................  ............................................  ............................................  ..................
#
     3.52%  ls       [kernel.kallsyms]     [k] sched_clock_noinstr                       [k] arch_counter_get_cntpct                   16
     3.52%  ls       [kernel.kallsyms]     [k] sched_clock                               [k] sched_clock_noinstr                       9
     1.85%  ls       [kernel.kallsyms]     [k] sched_clock_cpu                           [k] sched_clock                               5
     1.80%  ls       [kernel.kallsyms]     [k] irqtime_account_irq                       [k] sched_clock_cpu                           20
     1.58%  ls       [kernel.kallsyms]     [k] gic_handle_irq                            [k] generic_handle_domain_irq                 19
     1.58%  ls       [kernel.kallsyms]     [k] call_on_irq_stack                         [k] gic_handle_irq                            9
     1.58%  ls       [kernel.kallsyms]     [k] do_interrupt_handler                      [k] call_on_irq_stack                         23
     1.58%  ls       [kernel.kallsyms]     [k] generic_handle_domain_irq                 [k] __irq_resolve_mapping                     6
     1.58%  ls       [kernel.kallsyms]     [k] __irq_resolve_mapping                     [k] __rcu_read_lock                           10
-------------------------- Snip ----------------------

$perf report -D | grep cycles
-------------------------- Snip ----------------------
.....  1: ffff800080dd3334 -> ffff800080dd759c 39 cycles  P   0 IND_CALL
.....  2: ffff800080ffaea0 -> ffff800080ffb688 16 cycles  P   0 IND_CALL
.....  3: ffff800080139918 -> ffff800080ffae64 9  cycles  P   0 CALL
.....  4: ffff800080dd3324 -> ffff8000801398f8 7  cycles  P   0 CALL
.....  5: ffff8000800f8548 -> ffff800080dd330c 21 cycles  P   0 IND_CALL
.....  6: ffff8000800f864c -> ffff8000800f84ec 6  cycles  P   0 CALL
.....  7: ffff8000800f86dc -> ffff8000800f8638 11 cycles  P   0 CALL
.....  8: ffff8000800f86d4 -> ffff800081008630 16 cycles  P   0 CALL
-------------------------- Snip ----------------------

perf script and other tooling can also be applied on the captured perf.data
Similarly branch stack sampling records can be collected via direct system
call i.e perf_event_open() method after setting 'struct perf_event_attr' as
required.

event->attr.sample_type |= PERF_SAMPLE_BRANCH_STACK
event->attr.branch_sample_type |= PERF_SAMPLE_BRANCH_<FILTER_1> |
				  PERF_SAMPLE_BRANCH_<FILTER_2> |
				  PERF_SAMPLE_BRANCH_<FILTER_3> |
				  ...............................

But all branch filters might not be supported on the platform.

----------------------- BRBE Branch Filters Support -----------------------

- Following branch filters are supported on arm64.

	PERF_SAMPLE_BRANCH_USER		/* Branch privilege filters */
	PERF_SAMPLE_BRANCH_KERNEL
	PERF_SAMPLE_BRANCH_HV

	PERF_SAMPLE_BRANCH_ANY		/* Branch type filters */
	PERF_SAMPLE_BRANCH_ANY_CALL
	PERF_SAMPLE_BRANCH_ANY_RETURN
	PERF_SAMPLE_BRANCH_IND_CALL
	PERF_SAMPLE_BRANCH_COND
	PERF_SAMPLE_BRANCH_IND_JUMP
	PERF_SAMPLE_BRANCH_CALL

	PERF_SAMPLE_BRANCH_NO_FLAGS	/* Branch record flags */
	PERF_SAMPLE_BRANCH_NO_CYCLES
	PERF_SAMPLE_BRANCH_TYPE_SAVE
	PERF_SAMPLE_BRANCH_HW_INDEX
	PERF_SAMPLE_BRANCH_PRIV_SAVE

- Following branch filters are not supported on arm64.

	PERF_SAMPLE_BRANCH_ABORT_TX
	PERF_SAMPLE_BRANCH_IN_TX
	PERF_SAMPLE_BRANCH_NO_TX
	PERF_SAMPLE_BRANCH_CALL_STACK

Events requesting above non-supported branch filters get rejected.

--------------------------- Virtualisation support ------------------------

- No guest support

-------------------------------- Testing ---------------------------------

- Cross compiled for both arm64 and arm32 platforms
- Passes all branch tests with 'perf test branch' on arm64
``` 
 
Link:https://gitee.com/openeuler/kernel/pulls/11950

 

Reviewed-by: default avatarXu Kuohai <xukuohai@huawei.com>
Reviewed-by: default avatarZenghui Yu <yuzenghui@huawei.com>
Reviewed-by: default avatarLiu Chao <liuchao173@huawei.com>
Reviewed-by: default avatarZhang Jianhua <chris.zjh@huawei.com>
Signed-off-by: default avatarZhang Peng <zhangpeng362@huawei.com>
parents b95a35e8 84d0c33e
Loading
Loading
Loading
Loading
+21 −0
Original line number Diff line number Diff line
@@ -349,6 +349,27 @@ Before jumping into the kernel, the following conditions must be met:

    - HWFGWTR_EL2.nSMPRI_EL1 (bit 54) must be initialised to 0b01.

  For CPUs with feature Branch Record Buffer Extension (FEAT_BRBE):

  - If EL3 is present:

    - MDCR_EL3.SBRBE (bits 33:32) must be initialised to 0b11.

  - If the kernel is entered at EL1 and EL2 is present:

    - BRBCR_EL2.CC (bit 3) must be initialised to 0b1.
    - BRBCR_EL2.MPRED (bit 4) must be initialised to 0b1.

    - HDFGRTR_EL2.nBRBDATA (bit 61) must be initialised to 0b1.
    - HDFGRTR_EL2.nBRBCTL  (bit 60) must be initialised to 0b1.
    - HDFGRTR_EL2.nBRBIDR  (bit 59) must be initialised to 0b1.

    - HDFGWTR_EL2.nBRBDATA (bit 61) must be initialised to 0b1.
    - HDFGWTR_EL2.nBRBCTL  (bit 60) must be initialised to 0b1.

    - HFGITR_EL2.nBRBIALL (bit 56) must be initialised to 0b1.
    - HFGITR_EL2.nBRBINJ  (bit 55) must be initialised to 0b1.

  For CPUs with the Scalable Matrix Extension FA64 feature (FEAT_SME_FA64):

  - If EL3 is present:
+1 −0
Original line number Diff line number Diff line
@@ -6841,6 +6841,7 @@ CONFIG_ARM_PMU=y
CONFIG_ARM_PMU_ACPI=y
CONFIG_ARM_SMMU_V3_PMU=m
CONFIG_ARM_PMUV3=y
CONFIG_ARM64_BRBE=y
# CONFIG_ARM_DSU_PMU is not set
CONFIG_QCOM_L2_PMU=y
CONFIG_QCOM_L3_PMU=y
+84 −3
Original line number Diff line number Diff line
@@ -155,6 +155,40 @@
.Lskip_set_cptr_\@:
.endm

/*
 * Enable BRBE to record cycle counts and branch mispredicts.
 *
 * At any EL, to record cycle counts BRBE requires that both
 * BRBCR_EL2.CC=1 and BRBCR_EL1.CC=1.
 *
 * At any EL, to record branch mispredicts BRBE requires that both
 * BRBCR_EL2.MPRED=1 and BRBCR_EL1.MPRED=1.
 *
 * When HCR_EL2.E2H=1, the BRBCR_EL1 encoding is redirected to
 * BRBCR_EL2, but the {CC,MPRED} bits in the real BRBCR_EL1 register
 * still apply.
 *
 * Set {CC,MPRBED} in both BRBCR_EL2 and BRBCR_EL1 so that at runtime we
 * only need to enable/disable thse in BRBCR_EL1 regardless of whether
 * the kernel ends up executing in EL1 or EL2.
 */
.macro __init_el2_brbe
	mrs	x1, id_aa64dfr0_el1
	ubfx	x1, x1, #ID_AA64DFR0_EL1_BRBE_SHIFT, #4
	cbz	x1, .Lskip_brbe_\@

	mov_q	x0, BRBCR_ELx_CC | BRBCR_ELx_MPRED
	msr_s	SYS_BRBCR_EL2, x0

	__check_hvhe .Lset_brbe_nvhe_\@, x1
	msr_s	SYS_BRBCR_EL12, x0	// VHE
	b	.Lskip_brbe_\@

.Lset_brbe_nvhe_\@:
	msr_s	SYS_BRBCR_EL1, x0	// NVHE
.Lskip_brbe_\@:
.endm

/* Disable any fine grained traps */
.macro __init_el2_fgt
	mrs	x1, id_aa64mmfr0_el1
@@ -162,16 +196,48 @@
	cbz	x1, .Lskip_fgt_\@

	mov	x0, xzr
	mov	x2, xzr
	mrs	x1, id_aa64dfr0_el1
	ubfx	x1, x1, #ID_AA64DFR0_EL1_PMSVer_SHIFT, #4
	cmp	x1, #3
	b.lt	.Lset_debug_fgt_\@

	/* Disable PMSNEVFR_EL1 read and write traps */
	orr	x0, x0, #(1 << 62)
	orr	x0, x0, #HDFGRTR_EL2_nPMSNEVFR_EL1_MASK
	orr	x2, x2, #HDFGWTR_EL2_nPMSNEVFR_EL1_MASK

.Lset_debug_fgt_\@:
#ifdef CONFIG_ARM64_BRBE
	mrs	x1, id_aa64dfr0_el1
	ubfx	x1, x1, #ID_AA64DFR0_EL1_BRBE_SHIFT, #4
	cbz	x1, .Lskip_brbe_reg_fgt_\@

	/*
	 * Disable read traps for the following registers
	 *
	 * [BRBSRC|BRBTGT|RBINF]_EL1
	 * [BRBSRCINJ|BRBTGTINJ|BRBINFINJ|BRBTS]_EL1
	 */
	orr	x0, x0, #HDFGRTR_EL2_nBRBDATA_MASK

	/*
	 * Disable write traps for the following registers
	 *
	 * [BRBSRCINJ|BRBTGTINJ|BRBINFINJ|BRBTS]_EL1
	 */
	orr	x2, x2, #HDFGWTR_EL2_nBRBDATA_MASK

	/* Disable read and write traps for [BRBCR|BRBFCR]_EL1 */
	orr	x0, x0, #HDFGRTR_EL2_nBRBCTL_MASK
	orr	x2, x2, #HDFGWTR_EL2_nBRBCTL_MASK

	/* Disable read traps for BRBIDR_EL1 */
	orr	x0, x0, #HDFGRTR_EL2_nBRBIDR_MASK

.Lskip_brbe_reg_fgt_\@:
#endif /* CONFIG_ARM64_BRBE */
	msr_s	SYS_HDFGRTR_EL2, x0
	msr_s	SYS_HDFGWTR_EL2, x0
	msr_s	SYS_HDFGWTR_EL2, x2

	mov	x0, xzr
	mrs	x1, id_aa64pfr1_el1
@@ -194,7 +260,21 @@
.Lset_fgt_\@:
	msr_s	SYS_HFGRTR_EL2, x0
	msr_s	SYS_HFGWTR_EL2, x0
	msr_s	SYS_HFGITR_EL2, xzr
	mov	x0, xzr
#ifdef CONFIG_ARM64_BRBE
	mrs	x1, id_aa64dfr0_el1
	ubfx	x1, x1, #ID_AA64DFR0_EL1_BRBE_SHIFT, #4
	cbz	x1, .Lskip_brbe_insn_fgt_\@

	/* Disable traps for BRBIALL instruction */
	orr	x0, x0, #HFGITR_EL2_nBRBIALL_MASK

	/* Disable traps for BRBINJ instruction */
	orr	x0, x0, #HFGITR_EL2_nBRBINJ_MASK

.Lskip_brbe_insn_fgt_\@:
#endif /* CONFIG_ARM64_BRBE */
	msr_s	SYS_HFGITR_EL2, x0

	mrs	x1, id_aa64pfr0_el1		// AMU traps UNDEF without AMU
	ubfx	x1, x1, #ID_AA64PFR0_EL1_AMU_SHIFT, #4
@@ -229,6 +309,7 @@
	__init_el2_nvhe_idregs
	__init_el2_cptr
	__init_el2_fgt
	__init_el2_brbe
.endm

#ifndef __KVM_NVHE_HYPERVISOR__
+2 −2
Original line number Diff line number Diff line
@@ -616,7 +616,7 @@ static __always_inline u64 kvm_get_reset_cptr_el2(struct kvm_vcpu *vcpu)
		val = (CPACR_EL1_FPEN_EL0EN | CPACR_EL1_FPEN_EL1EN);

		if (!vcpu_has_sve(vcpu) ||
		    (vcpu->arch.fp_state != FP_STATE_GUEST_OWNED))
		    (*host_data_ptr(fp_owner) != FP_STATE_GUEST_OWNED))
			val |= CPACR_EL1_ZEN_EL1EN | CPACR_EL1_ZEN_EL0EN;
		if (cpus_have_final_cap(ARM64_SME))
			val |= CPACR_EL1_SMEN_EL1EN | CPACR_EL1_SMEN_EL0EN;
@@ -624,7 +624,7 @@ static __always_inline u64 kvm_get_reset_cptr_el2(struct kvm_vcpu *vcpu)
		val = CPTR_NVHE_EL2_RES1;

		if (vcpu_has_sve(vcpu) &&
		    (vcpu->arch.fp_state == FP_STATE_GUEST_OWNED))
		    (*host_data_ptr(fp_owner) == FP_STATE_GUEST_OWNED))
			val |= CPTR_EL2_TZ;
		if (cpus_have_final_cap(ARM64_SME))
			val &= ~CPTR_EL2_TSM;
+67 −27
Original line number Diff line number Diff line
@@ -450,8 +450,43 @@ struct kvm_cpu_context {
	struct kvm_vcpu *__hyp_running_vcpu;
};

/*
 * This structure is instantiated on a per-CPU basis, and contains
 * data that is:
 *
 * - tied to a single physical CPU, and
 * - either have a lifetime that does not extend past vcpu_put()
 * - or is an invariant for the lifetime of the system
 *
 * Use host_data_ptr(field) as a way to access a pointer to such a
 * field.
 */
struct kvm_host_data {
	struct kvm_cpu_context host_ctxt;
	struct user_fpsimd_state *fpsimd_state;	/* hyp VA */

	/* Ownership of the FP regs */
	enum {
		FP_STATE_FREE,
		FP_STATE_HOST_OWNED,
		FP_STATE_GUEST_OWNED,
	} fp_owner;

	/*
	 * host_debug_state contains the host registers which are
	 * saved and restored during world switches.
	 */
	 struct {
		/* {Break,watch}point registers */
		struct kvm_guest_debug_arch regs;
		/* Statistical profiling extension */
		u64 pmscr_el1;
		/* Self-hosted trace */
		u64 trfcr_el1;
		/* Values of trap registers for the host before guest entry. */
		u64 mdcr_el2;
		u64 brbcr_el1;
	} host_debug_state;
};

struct kvm_host_psci_config {
@@ -510,19 +545,9 @@ struct kvm_vcpu_arch {
	u64 mdcr_el2;
	u64 cptr_el2;

	/* Values of trap registers for the host before guest entry. */
	u64 mdcr_el2_host;

	/* Exception Information */
	struct kvm_vcpu_fault_info fault;

	/* Ownership of the FP regs */
	enum {
		FP_STATE_FREE,
		FP_STATE_HOST_OWNED,
		FP_STATE_GUEST_OWNED,
	} fp_state;

	/* Configuration flags, set once and for all before the vcpu can run */
	u8 cflags;

@@ -545,11 +570,10 @@ struct kvm_vcpu_arch {
	 * We maintain more than a single set of debug registers to support
	 * debugging the guest from the host and to maintain separate host and
	 * guest state during world switches. vcpu_debug_state are the debug
	 * registers of the vcpu as the guest sees them.  host_debug_state are
	 * the host registers which are saved and restored during
	 * world switches. external_debug_state contains the debug
	 * values we want to debug the guest. This is set via the
	 * KVM_SET_GUEST_DEBUG ioctl.
	 * registers of the vcpu as the guest sees them.
	 *
	 * external_debug_state contains the debug values we want to debug the
	 * guest. This is set via the KVM_SET_GUEST_DEBUG ioctl.
	 *
	 * debug_ptr points to the set of debug registers that should be loaded
	 * onto the hardware when running the guest.
@@ -558,18 +582,8 @@ struct kvm_vcpu_arch {
	struct kvm_guest_debug_arch vcpu_debug_state;
	struct kvm_guest_debug_arch external_debug_state;

	struct user_fpsimd_state *host_fpsimd_state;	/* hyp VA */
	struct task_struct *parent_task;

	struct {
		/* {Break,watch}point registers */
		struct kvm_guest_debug_arch regs;
		/* Statistical profiling extension */
		u64 pmscr_el1;
		/* Self-hosted trace */
		u64 trfcr_el1;
	} host_debug_state;

	/* VGIC state */
	struct vgic_cpu vgic_cpu;
	struct arch_timer_cpu timer_cpu;
@@ -755,8 +769,8 @@ struct kvm_vcpu_arch {
#define DEBUG_STATE_SAVE_SPE	__vcpu_single_flag(iflags, BIT(5))
/* Save TRBE context if active  */
#define DEBUG_STATE_SAVE_TRBE	__vcpu_single_flag(iflags, BIT(6))
/* vcpu running in HYP context */
#define VCPU_HYP_CONTEXT	__vcpu_single_flag(iflags, BIT(7))
/* Save BRBE context if active  */
#define DEBUG_STATE_SAVE_BRBE	__vcpu_single_flag(iflags, BIT(7))

/* SVE enabled for host EL0 */
#define HOST_SVE_ENABLED	__vcpu_single_flag(sflags, BIT(0))
@@ -1129,6 +1143,32 @@ struct kvm_vcpu *kvm_mpidr_to_vcpu(struct kvm *kvm, unsigned long mpidr);

DECLARE_KVM_HYP_PER_CPU(struct kvm_host_data, kvm_host_data);

/*
 * How we access per-CPU host data depends on the where we access it from,
 * and the mode we're in:
 *
 * - VHE and nVHE hypervisor bits use their locally defined instance
 *
 * - the rest of the kernel use either the VHE or nVHE one, depending on
 *   the mode we're running in.
 *
 *   Unless we're in protected mode, fully deprivileged, and the nVHE
 *   per-CPU stuff is exclusively accessible to the protected EL2 code.
 *   In this case, the EL1 code uses the *VHE* data as its private state
 *   (which makes sense in a way as there shouldn't be any shared state
 *   between the host and the hypervisor).
 *
 * Yes, this is all totally trivial. Shoot me now.
 */
#if defined(__KVM_NVHE_HYPERVISOR__) || defined(__KVM_VHE_HYPERVISOR__)
#define host_data_ptr(f)	(&this_cpu_ptr(&kvm_host_data)->f)
#else
#define host_data_ptr(f)						\
	(static_branch_unlikely(&kvm_protected_mode_initialized) ?	\
	 &this_cpu_ptr(&kvm_host_data)->f :				\
	 &this_cpu_ptr_hyp_sym(kvm_host_data)->f)
#endif

static inline void kvm_init_host_cpu_context(struct kvm_cpu_context *cpu_ctxt)
{
	/* The host's MPIDR is immutable, so let's set it up at boot time */
Loading