|author||will <Will.Liang@quantatw.com>||2019-02-26 11:01:00 +0800|
|committer||Gunnar Mills <firstname.lastname@example.org>||2019-04-18 20:11:50 +0000|
Create service to monitor EDAC driver and add log on BMC. Change-Id: Ifa3e3d103384eaaaaf9fa18a28e25fee8ec20bea Signed-off-by: will <Will.Liang@quantatw.com>
1 files changed, 126 insertions, 0 deletions
diff --git a/designs/ecc_dbus_sel.md b/designs/ecc_dbus_sel.md
new file mode 100644
@@ -0,0 +1,126 @@
+### ECC Error SEL for BMC
+Author: Will Liang
+Primary assignee: Will Liang
+#### Problem Description
+The IPMI SELs only define memory Error Correction Code (ECC) errors for host
+memory rather than BMC.
+The aim of this proposal is to record ECC events from the BMC in the IPMI System
+Event Log (SEL). Whenever ECC occurs, the BMC generates an event with the
+appropriate information and adds it to the SEL.
+#### Background and References
+The IPMI specification defines memory system event log about ECC/other
+correctable or ECC/other uncorrectable and whether ECC/other correctable memory
+error logging limits are reached.. The BMC ECC SEL will follow IPMI SEL
+format and creates BMC memory ECC event log.
+OpenBMC currently support for generating SEL entries based on parsing the D-Bus
+event log. It does not yet support the BMC ECC SEL feature in OpenBMC project.
+Therefore, the memory ECC information will be registered to D-Bus and generate
+memory ECC SEL as well.
+[Intelligent Platform Management Interface Specification v2.0 rev 1.1, section 41](https://www.intel.com/content/www/us/en/servers/ipmi/ipmi-second-gen-interface-spec-v2-rev1-1.html)
+Currently, the OpenBMC project does not support ECC event logs in D-Bus because
+there is no relevant ECC information in the OpenBMC D-Bus architecture.
+The new ECC D-Bus information will be added to the OpenBMC project and an ECC
+monitor service will be created to fetch the ECC count (ce_count/ue_count) from
+the EDAC driver. And make sure the EDAC driver must be loaded and ECC/other
+correctable or ECC/other uncorrectable counts need to be obtained from the EDAC
+#### Proposed Design
+ECC-enabled memory controllers can detect and correct errors in operating
+systems (such as certain versions of Linux, macOS, and Windows) that allow
+detection and correction of memory errors, which helps identify problems before
+they become catastrophic faulty memory module.
+Many ECC memory systems use an "external" EDAC between the CPU and the memory
+to fix memory error. Most host integrate EDAC into the CPU's integrated memory
+According to Section 42.2 of the IPMI specification, Table 42 , these SEL
+sensor types will be defined as `Memory` and `Event Data 3` field can be used to
+provide an event extension. Therefore, the BMC ECC event sets "Event Data 3"
+with the value FEh to identify the BMC ECC error.
+[ Intelligent Platform Management Interface Specification v2.0 rev 1.1, section 42.2](https://www.intel.com/content/www/us/en/servers/ipmi/ipmi-second-gen-interface-spec-v2-rev1-1.html)
+The main purpose of this function is to provide the BMC with the ability to
+record ECC error SELs.
+There are two new applications for this design:
+- poll the ECC error count
+- create the ECC SEL
+It also devised a mechanism to limit the "maximum number" of logs to avoid
+creating a large number of correctable ECC logs. When the `maximum quantity` is
+reached, the ECC service will stop to record the ECC log. The `maximum quantity`
+(default:100) is saved in the configuration file, and the user can modify the
+value if necessary.
+This will always run the application and look up the ECC error count every
+second after service is started. On first start, it resets all correctable ECC
+counts and uncorrectable ECC counts in the EDAC driver.
+It also provide the following path on D-Bus:
+- bus name : `xyz.openbmc_project.Memory.ECC`
+- object path : `/xyz/openbmc_project/metrics/memory/BmcECC`
+- interface : `xyz.openbmc_project.Memory.MemoryECC`
+The interface with the following properties:
+| Property | Type | Description |
+| -------- | ---- | ----------- |
+| isLoggingLimitReached | bool | ECC logging reach limits|
+| ceCount| int64 | correctable ECC events |
+| ueCount| int64 | uncorrectable ECC events |
+| state| string | bmc ECC event state |
+The error types for `xyz::openbmc_project::Memory::Ecc::Error::ceCount` and
+`ueCount` and `isLoggingLimitReached` will be created which generated the error
+type for the ECC logs.
+##### Create the ECC SEL
+Use the `phosphor-sel-logger` package to record the following logs in BMC SEL
+- correctable ECC log : when fetching the `ce_count` from EDAC driver parameter
+ and the count exceeds previous count.
+- uncorrectable ECC log : when fetching the `ue_count` from EDAC driver parameter
+ and the count exceeds previous count.
+- logging limit reached log : When the correctable ECC log reaches the
+ `maximum quantity`.
+#### Alternatives Considered
+Another consideration is that there is no stopping the recording of the ECC
+When the checks `ce_count` and value exceeds the previous value, it will record
+the ECC log. But this will encounter a lot of ECC logs, and BMC memory will
+also be occupied.
+This application implementation only needs to make some changes when
+creating the event log, so it has minimal impact on the rest of the system.
+Depending on the platform hardware design, this test requires an ECC
+driver to make fake ECC errors and then check the scenario is good. \ No newline at end of file