Skip to content
Commit cb51a371 authored by Robert Richter's avatar Robert Richter Committed by Borislav Petkov
Browse files

EDAC/ghes: Setup DIMM label from DMI and use it in error reports



The ghes driver reports errors with 'unknown label' even if the actual
DIMM label is known, e.g.:

 EDAC MC0: 1 CE Single-bit ECC on unknown label (node:0 card:0
   module:0 rank:1 bank:0 col:13 bit_pos:16 DIMM location:N0 DIMM_A0
   page:0x966a9b3 offset:0x0 grain:1 syndrome:0x0 - APEI location:
   node:0 card:0 module:0 rank:1 bank:0 col:13 bit_pos:16 DIMM
   location:N0 DIMM_A0 status(0x0000000000000400): Storage error in
   DRAM memory)

Fix this by using struct dimm_info's label string in error reports:

 EDAC MC0: 1 CE Single-bit ECC on N0 DIMM_A0 (node:0 card:0 module:0
   rank:1 bank:515 col:14 bit_pos:16 DIMM location:N0 DIMM_A0
   page:0x99223d8 offset:0x0 grain:1 syndrome:0x0 - APEI location:
   node:0 card:0 module:0 rank:1 bank:515 col:14 bit_pos:16 DIMM
   location:N0 DIMM_A0 status(0x0000000000000400): Storage error in
   DRAM memory)

The labels are initialized by reading the bank and device strings
from DMI. Now, the label information can also read from sysfs. E.g. a
ThunderX2 system will show the following:

  /sys/devices/system/edac/mc/mc0/dimm0/dimm_label:N0 DIMM_A0
  /sys/devices/system/edac/mc/mc0/dimm1/dimm_label:N0 DIMM_B0
  /sys/devices/system/edac/mc/mc0/dimm2/dimm_label:N0 DIMM_C0
  /sys/devices/system/edac/mc/mc0/dimm3/dimm_label:N0 DIMM_D0
  /sys/devices/system/edac/mc/mc0/dimm4/dimm_label:N0 DIMM_E0
  /sys/devices/system/edac/mc/mc0/dimm5/dimm_label:N0 DIMM_F0
  /sys/devices/system/edac/mc/mc0/dimm6/dimm_label:N0 DIMM_G0
  /sys/devices/system/edac/mc/mc0/dimm7/dimm_label:N0 DIMM_H0
  /sys/devices/system/edac/mc/mc0/dimm8/dimm_label:N1 DIMM_I0
  /sys/devices/system/edac/mc/mc0/dimm9/dimm_label:N1 DIMM_J0
  /sys/devices/system/edac/mc/mc0/dimm10/dimm_label:N1 DIMM_K0
  /sys/devices/system/edac/mc/mc0/dimm11/dimm_label:N1 DIMM_L0
  /sys/devices/system/edac/mc/mc0/dimm12/dimm_label:N1 DIMM_M0
  /sys/devices/system/edac/mc/mc0/dimm13/dimm_label:N1 DIMM_N0
  /sys/devices/system/edac/mc/mc0/dimm14/dimm_label:N1 DIMM_O0
  /sys/devices/system/edac/mc/mc0/dimm15/dimm_label:N1 DIMM_P0

Since dimm_labels can be rewritten, that label will be used in a later
error report:

  # echo foobar >/sys/devices/system/edac/mc/mc0/dimm0/dimm_label
  # # some error injection here
  # dmesg | grep foobar
  [ 751.383533] EDAC MC0: 1 CE Single-bit ECC on foobar (node:0 card:0
  module:0 rank:1 bank:259 col:3 bit_pos:16 DIMM location:N0 DIMM_A0
  page:0x8c8dc74 offset:0x0 grain:1 syndrome:0x0 - APEI location:
  node:0 card:0 module:0 rank:1 bank:259 col:3 bit_pos:16 DIMM
  location:N0 DIMM_A0 status(0x0000000000000400): Storage error in DRAM
  memory)

 [ bp: Remove curly brackets around a single if-statement in dimm_setup_label(). ]

Signed-off-by: default avatarRobert Richter <rrichter@marvell.com>
Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200528101307.23245-1-rrichter@marvell.com
parent b3a9e3b9
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment