+12
−0
+1
−0
+17
−8
+13
−0
Loading
Merge Pull Request from: @PrithivishS *Description:* -------------- Patches to introduce AMD Address Translation Library (ATL) on Turin systems. RAS: Introduce AMD Address Translation Library RAS/AMD/ATL: Add MI300 support RAS/AMD/ATL: Add MI300 DRAM to normalized address translation support RAS/AMD/ATL: Add MI300 row retirement support RAS: Export helper to get ras_debugfs_dir EDAC/amd64: Use new AMD Address Translation Library RAS/AMD/ATL: Add amd_atl pr_fmt() prefix RAS/AMD/ATL: Read DRAM hole base early RAS/AMD/ATL: Expand helpers for adding and removing base and hole RAS/AMD/ATL: Validate address map when information is gathered RAS/AMD/ATL: Implement DF 4.5 NP2 denormalization RAS: Introduce a FRU memory poison manager RAS/AMD/FMPM: Save SPA values RAS/AMD/FMPM: Add debugfs interface to print record entries RAS/AMD/FMPM: Fix off by one when unwinding on error RAS/AMD/FMPM: Avoid NULL ptr deref in get_saved_records() RAS/AMD/FMPM: Safely handle saved records of various sizes RAS/AMD/FMPM: Use atl internal.h for INVALID_SPA By using RAS error injection tests we have captured the following logs **Without ATL patches in Turin systems** ```javascript [ 333.395989] mce: [Hardware Error]: Machine check events logged [ 333.396016] [Hardware Error]: Corrected error, no action required. [ 333.396028] [Hardware Error]: CPU:0 (1a:1:0) MC21_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000400011b [ 333.396050] [Hardware Error]: Error Addr: 0x0000000000000000 [ 333.396058] [Hardware Error]: PPIN: 0x008319f6b6f24004 [ 333.396066] [Hardware Error]: IPID: 0x0000009600050f00, Syndrome: 0x7c7600010a800100 [ 333.396078] [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error. [ 333.396096] umc_normaddr_to_sysaddr: Invalid DramBaseAddress range: 0x0. [ 333.396119] EDAC MC0: 1 CE Cannot decode normalized address on mc#0csrow#0channel#0 (csrow:0 channel:0 page:0x0 offset:0x0 grain:64 syndrome:0x1) [ 333.396139] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD ``` **With ATL patches in turin systems** ```javascript [ 333.098899] [Hardware Error]: Corrected error, no action required. [ 333.098910] [Hardware Error]: CPU:0 (1a:1:0) MC21_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000400011b [ 333.098929] [Hardware Error]: Error Addr: 0x0000000000000000 [ 333.098936] [Hardware Error]: PPIN: 0x008319f6b6f24004 [ 333.098943] [Hardware Error]: IPID: 0x0000009600050f00, Syndrome: 0x7c7600010a800100 [ 333.098954] [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error. [ 333.098992] EDAC MC0: 1 CE on mc#0csrow#0channel#0 (csrow:0 channel:0 page:0x0 offset:0x0 grain:64 syndrome:0x1) [ 333.099005] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD ``` The Address Translational Library (ATL) backport has also been system tested. Link:https://gitee.com/openeuler/kernel/pulls/12300 Reviewed-by:XiaoFei Tan <tanxiaofei@huawei.com> Signed-off-by:
Zhang Peng <zhangpeng362@huawei.com>