Unverified Commit 87814681 authored by openeuler-ci-bot's avatar openeuler-ci-bot Committed by Gitee
Browse files

!34 SPR: HBM EDAC and MCA recovery enhancement and bug fix

[Description]​
MCA recovery with OS kernel assistance get uncorrected data errors isolated and recovery and at the same time EDAC driver will also translate the memory error address to detail locations alike:socket/MC/channel/dimm/bank/row/column etc. which will benefit dev&ops greatly. 
Recent, upstream there are some MCA recovery feature enhancement and bug fix and EDAC will get SPR(Sapphire Rapids) HBM(High Bandwidth Memory) supported and as well as enhancement/bug fix.  
BTW: These backported patches almost all are directly applied from upstream patches/commits. 

upstream commits list here:
ea6d0630 mm/hwpoison: do not lock page again when me_huge_page() successfully recovers
a3f5d80e mm,hwpoison: send SIGBUS with error virutal address
a6e3cf70 x86/mce: Change to not send SIGBUS error during copy from user
bc1bb416 generic_perform_write()/iomap_write_actor(): saner logics for short copy
69065847 x86/mce: Drop copyin special case for #MC
33761363 x86/mce: Reduce number of machine checks taken during recovery
046545a6 mm/hwpoison: fix error page recovered but reported "not recovered"
bc1c99a5 EDAC: Add DDR5 new memory type
479f58dd EDAC/i10nm: Add Intel Sapphire Rapids server support
cf4e6d52 EDAC/i10nm: Retrieve and print retry_rd_err_log registers
2f4348e5 EDAC/skx_common: Add new ADXL components for 2-level memory
4bd4d32e EDAC/i10nm: Add detection of memory levels for ICX/SPR servers
c9450883 EDAC/i10nm: Add support for high bandwidth memory
e1ca90b7 EDAC/mc: Add new HBM2 memory type
7cb58db64ca7ab020850ef8543d9f583b820dde0 EDAC/skx_common: Set the memory type correctly for HBM memory
c370baa3 EDAC/i10nm: Release mdev/mbase when failing to detect HBM

[Testing]
kernel options:
CONFIG_X86_MCE=y
CONFIG_MEMORY_FAILURE=y
CONFIG_X86_MCE_INTEL=m
CONFIG_ACPI_APEI_EINJ=m
CONFIG_EDAC=y
CONFIG_EDAC_I10NM=m   

1.https://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git
./einj_mem_uc -f -m 128:512:0 copyin : MCE events triggered decrease.
./einj_mem_uc single -f : repeat to run, sometime notice "not recovered" without patch
When we inject error and got MCA recovery. run "dmesg" also to check if error address is decoded to detail location.
2. modprobe einj, ./cmcistorm 1 : inject CE for HBM address - check EDAC decode the locate for address in HBM range. it will also apply to test 2-level memory. 

Note: SPR HBM EDAC retry_rd_err_log is not supported yet and which be backported when the patches are ready in upstream.

Link: https://gitee.com/openeuler/kernel/pulls/34

 

From: @youquan_song 
Reviewed-by: default avatarXie XiuQi <xiexiuqi@huawei.com>
Reviewed-by: default avatarZheng Zengkai <zhengzengkai@huawei.com>
Signed-off-by: default avatarXie XiuQi <xiexiuqi@huawei.com>
parents 93e5fa79 028e1264
Loading
Loading
Loading
Loading
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please to comment