From d22d73fc172327672c6bafaefa50fed1ce020d4a Mon Sep 17 00:00:00 2001 From: will Date: Tue, 26 Feb 2019 11:01:00 +0800 Subject: designs: Add ECC logging for BMC Create service to monitor EDAC driver and add log on BMC. Change-Id: Ifa3e3d103384eaaaaf9fa18a28e25fee8ec20bea Signed-off-by: will --- designs/ecc_dbus_sel.md | 126 ++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 126 insertions(+) create mode 100644 designs/ecc_dbus_sel.md diff --git a/designs/ecc_dbus_sel.md b/designs/ecc_dbus_sel.md new file mode 100644 index 0000000..b7ce07b --- /dev/null +++ b/designs/ecc_dbus_sel.md @@ -0,0 +1,126 @@ +### ECC Error SEL for BMC + +Author: Will Liang + +Primary assignee: Will Liang + +Created: 2019-02-26 + +#### Problem Description + +The IPMI SELs only define memory Error Correction Code (ECC) errors for host +memory rather than BMC. + +The aim of this proposal is to record ECC events from the BMC in the IPMI System +Event Log (SEL). Whenever ECC occurs, the BMC generates an event with the +appropriate information and adds it to the SEL. + +#### Background and References + +The IPMI specification defines memory system event log about ECC/other +correctable or ECC/other uncorrectable and whether ECC/other correctable memory +error logging limits are reached.[1]. The BMC ECC SEL will follow IPMI SEL +format and creates BMC memory ECC event log. + +OpenBMC currently support for generating SEL entries based on parsing the D-Bus +event log. It does not yet support the BMC ECC SEL feature in OpenBMC project. +Therefore, the memory ECC information will be registered to D-Bus and generate +memory ECC SEL as well. + +[[1]Intelligent Platform Management Interface Specification v2.0 rev 1.1, section 41](https://www.intel.com/content/www/us/en/servers/ipmi/ipmi-second-gen-interface-spec-v2-rev1-1.html) + +#### Requirements + +Currently, the OpenBMC project does not support ECC event logs in D-Bus because +there is no relevant ECC information in the OpenBMC D-Bus architecture. +The new ECC D-Bus information will be added to the OpenBMC project and an ECC +monitor service will be created to fetch the ECC count (ce_count/ue_count) from +the EDAC driver. And make sure the EDAC driver must be loaded and ECC/other +correctable or ECC/other uncorrectable counts need to be obtained from the EDAC +driver. + +#### Proposed Design + +ECC-enabled memory controllers can detect and correct errors in operating +systems (such as certain versions of Linux, macOS, and Windows) that allow +detection and correction of memory errors, which helps identify problems before +they become catastrophic faulty memory module. + +Many ECC memory systems use an "external" EDAC between the CPU and the memory +to fix memory error. Most host integrate EDAC into the CPU's integrated memory +controller. + +According to Section 42.2 of the IPMI specification, Table 42 [2], these SEL +sensor types will be defined as `Memory` and `Event Data 3` field can be used to +provide an event extension. Therefore, the BMC ECC event sets "Event Data 3" +with the value FEh to identify the BMC ECC error. + +[[2] Intelligent Platform Management Interface Specification v2.0 rev 1.1, section 42.2](https://www.intel.com/content/www/us/en/servers/ipmi/ipmi-second-gen-interface-spec-v2-rev1-1.html) + +The main purpose of this function is to provide the BMC with the ability to +record ECC error SELs. + +There are two new applications for this design: + +- poll the ECC error count +- create the ECC SEL + +It also devised a mechanism to limit the "maximum number" of logs to avoid +creating a large number of correctable ECC logs. When the `maximum quantity` is +reached, the ECC service will stop to record the ECC log. The `maximum quantity` +(default:100) is saved in the configuration file, and the user can modify the +value if necessary. + +##### phosphor-ecc.service + +This will always run the application and look up the ECC error count every +second after service is started. On first start, it resets all correctable ECC +counts and uncorrectable ECC counts in the EDAC driver. + +It also provide the following path on D-Bus: + +- bus name : `xyz.openbmc_project.Memory.ECC` +- object path : `/xyz/openbmc_project/metrics/memory/BmcECC` +- interface : `xyz.openbmc_project.Memory.MemoryECC` + +The interface with the following properties: +| Property | Type | Description | +| -------- | ---- | ----------- | +| isLoggingLimitReached | bool | ECC logging reach limits| +| ceCount| int64 | correctable ECC events | +| ueCount| int64 | uncorrectable ECC events | +| state| string | bmc ECC event state | + +The error types for `xyz::openbmc_project::Memory::Ecc::Error::ceCount` and +`ueCount` and `isLoggingLimitReached` will be created which generated the error +type for the ECC logs. + +##### Create the ECC SEL + +Use the `phosphor-sel-logger` package to record the following logs in BMC SEL +format. + +- correctable ECC log : when fetching the `ce_count` from EDAC driver parameter + and the count exceeds previous count. +- uncorrectable ECC log : when fetching the `ue_count` from EDAC driver parameter + and the count exceeds previous count. +- logging limit reached log : When the correctable ECC log reaches the + `maximum quantity`. + +#### Alternatives Considered + +Another consideration is that there is no stopping the recording of the ECC +logging mechanism. +When the checks `ce_count` and value exceeds the previous value, it will record +the ECC log. But this will encounter a lot of ECC logs, and BMC memory will +also be occupied. + +#### Impacts + +This application implementation only needs to make some changes when +creating the event log, so it has minimal impact on the rest of the system. + +#### Testing + +Depending on the platform hardware design, this test requires an ECC +driver to make fake ECC errors and then check the scenario is good. \ No newline at end of file -- cgit v1.2.1