cpuinspect: add CPU-inspect infrastructure
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I7ZBQB ---------------------------------- This adds the CPU-inspect infrastructure. CPU-inspect is designed to provide a framework for early detection of SDC by proactively executing CPU inspection test cases. Silent Data Corruption (SDC), sometimes referred to as Silent Data Error (SDE), is an industry-wide issue impacting not only long-protected memory, storage, and networking, but also computer CPUs. As with software issues, hardware-induced SDC can contribute to data loss and corruption. An SDC occurs when an impacted CPU inadvertently causes errors in the data it processes. For example, an impacted CPU might miscalculate data (i.e., 1+1=3). There may be no indication of these computational errors unless the software systematically checks for errors [1]. SDC issues have been around for many years, but as chips have become more advanced and compact in size, the transistors and lines have become so tiny that small electrical fluctuations can cause errors. Most of these errors are caused by defects during manufacturing and are screened out by the vendors; others are caught by hardware error detection or correction. However, some errors go undetected by hardware; therefore only detection software can protect against such errors [1]. [1] https://support.google.com/cloud/answer/10759085 To use CPU-inspect, you need to load at least one inspector (the driver that specifically executes the CPU inspection code) Here is an example using CPU-inspect: # Set the cpumask of CPU-inspect to 10-20 echo 10-20 > /sys/devices/system/cpu/cpuinspect/cpumask # set the max cpu utility of inspectiono threads to 50% echo 50 > /sys/devices/system/cpu/cpuinspect/cpu_utility # start the CPU inspection task echo 1 > /sys/devices/system/cpu/cpuinspect/start_patrol # Check the result to see if some faulty cpu are found cat /sys/devices/system/cpu/cpuinspect/result In addition to being readable, the 'result' file in cpuinspect can also be polled. The user that use poll() to monitor 'result' will return when a faulty CPU is found or the inspection task is completed. Signed-off-by:Yu Liao <liaoyu15@huawei.com>
Loading
Please sign in to comment