!171 SPR: HBM retry_rd_err_log support
Merge Pull Request from: @youquan_song [Description] https://gitee.com/openeuler/intel-kernel/issues/I5V3SJ An HBM memory channel is divided into two pseudo channels. Each pseudo channel has its own retry_rd_err_log registers. Retrieve and print retry_rd_err_log registers of the HBM pseudo channel if the memory error is from HBM. 14646de4 EDAC/skx_common: Add ChipSelect ADXL component acd4cf68 EDAC/i10nm: Retrieve and print retry_rd_err_log registers for HBM d5f5e499 EDAC/i10nm: Print an extra register set of retry_rd_err_log [Testing] 1.Add kernel options in grub: efi=nosoftreserve i10nm_edac.retry_rd_err_log=1 2.numactl -H node distances: node 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0: 10 12 12 12 21 21 21 21 13 14 14 14 23 23 23 23 1: 12 10 12 12 21 21 21 21 14 13 14 14 23 23 23 23 2: 12 12 10 12 21 21 21 21 14 14 13 14 23 23 23 23 3: 12 12 12 10 21 21 21 21 14 14 14 13 23 23 23 23 4: 21 21 21 21 10 12 12 12 23 23 23 23 13 14 14 14 5: 21 21 21 21 12 10 12 12 23 23 23 23 14 13 14 14 6: 21 21 21 21 12 12 10 12 23 23 23 23 14 14 13 14 7: 21 21 21 21 12 12 12 10 23 23 23 23 14 14 14 13 8: 13 14 14 14 23 23 23 23 10 14 14 14 23 23 23 23 9: 14 13 14 14 23 23 23 23 14 10 14 14 23 23 23 23 10: 14 14 13 14 23 23 23 23 14 14 10 14 23 23 23 23 11: 14 14 14 13 23 23 23 23 14 14 14 10 23 23 23 23 12: 23 23 23 23 13 14 14 14 23 23 23 23 10 14 14 14 13: 23 23 23 23 14 13 14 14 23 23 23 23 14 10 14 14 14: 23 23 23 23 14 14 13 14 23 23 23 23 14 14 10 14 15: 23 23 23 23 14 14 14 13 23 23 23 23 14 14 14 10 3. #modprobe einj 4. git clone https://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git, build it 5. # numactl --cpunodebind=0 --membind=13 /home/ras-tools/cmcistorm 1 0: vaddr = 0x130d490 paddr = d87ff42490 6. #dmesg: output retry_rd_err_log registers value. [83086.997090] EDAC MC29: 0 CE memory read error on CPU_SrcID#1_HBMC#9_Chan#1 (channel:1 page:0xd87ff42 offset:0x480 grain:32 syndrome:0x0 - err_code:0x0000:0x009f SystemAddress:0xd87ff42480 ProcessorSocketId:0x1 MemoryControllerId:0x9 ChannelAddress:0x7ffe8480 ChannelId:0x1 RankAddress:0x3fff4240 PhysicalRankId:0x0 Row:0x7ffe Column:0x14 Bank:0x0 BankGroup:0x0 ChipSelect:0x2 ChipId:0x1 retry_rd_err_log[08928208 00000000 0000000000010000 00500081 80007ffe 000000d87ff42480] correrrcnt[0001 0000 0001 0000 0000 0000 0000 0000]) Link:https://gitee.com/openeuler/kernel/pulls/171 Reviewed-by:Chen Wei <chenwei@xfusion.com> Reviewed-by:
Xiongfeng Wang <wangxiongfeng2@huawei.com> Reviewed-by:
Zheng Zengkai <zhengzengkai@huawei.com> Reviewed-by:
Jun Tian <jun.j.tian@intel.com> Signed-off-by:
Zheng Zengkai <zhengzengkai@huawei.com>
Loading
Please sign in to comment