summaryrefslogtreecommitdiffstats
path: root/src/usr/isteps/nvdimm/ReadMe.md
diff options
context:
space:
mode:
Diffstat (limited to 'src/usr/isteps/nvdimm/ReadMe.md')
-rw-r--r--src/usr/isteps/nvdimm/ReadMe.md278
1 files changed, 278 insertions, 0 deletions
diff --git a/src/usr/isteps/nvdimm/ReadMe.md b/src/usr/isteps/nvdimm/ReadMe.md
new file mode 100644
index 000000000..1f98438b2
--- /dev/null
+++ b/src/usr/isteps/nvdimm/ReadMe.md
@@ -0,0 +1,278 @@
+# Battery Power Module (BPM) Updates Overview
+To support different firmware versions released by SMART, the bpm_update.C and
+bpm_update.H files were created to facilitate upgrades and downgrades of the
+firmware version on a BPM attached to an NVDIMM. There are two kinds of BPM, one
+that supports 16GB type NVDIMMs and one that supports 32GB type NVDIMMs.
+Although they have separate image files, the update is functionally the same for
+each. This overview will not go into fine-grain detail on every process of the
+update. For more information see the comments in bpm_update.H, bpm_update.C and
+in the various supporting files.
+
+Supporting Files:
+* Two image files, e.g., SRCA8062IBMH012B_FULL_FW_Rev1.03_02282019.txt or
+SRCA8062IBMH011B_FULL_FW_Rev1.04_05172019.txt
+ * The image file names are important in that they contain information that
+ is not found anywhere else in the files. For example, After SRCA8062IBMH01
+ but right before the B is a number. That signifies which kind of BPM type
+ that image is for. A 1 means 32gb type, a 2 means 16gb type. Also, note that
+ the version Rev1.0x is in the file name. There is no other place where this
+ occurs within the image file. So, to differentiate the updates from each
+ other the file names must be left intact.
+* src/build/buildpnor/buildBpmFlashImages.pl
+ * This perl script is responsible for packaging the image files listed above
+ into binaries that can then be associated with LIDs for use during the BPM
+ update.
+* src/build/buildpnor/bpm-utils/imageCrc.c and
+src/build/buildpnor/bpm-utils/insertBpmFwCrc.py
+ * These are provided by SMART and utilized by buildBpmFlashImages.pl to
+ generate the correct CRC for the firmware image during the fsp build.
+* src/build/mkrules/dist.targets.mk
+ * This file puts src/build/buildpnor/buildBpmFlashImages.pl,
+ src/build/buildpnor/bpm-utils/imageCrc.c,
+ and src/build/buildpnor/bpm-utils/insertBpmFwCrc.py into the fsp.tar which
+ can then be primed over to an FSP sandbox.
+* <fsp_sandbox>/src/engd/nvdimm/makefile
+ * This makefile compiles the src/build/buildpnor/bpm-utils/imageCrc.c and
+ calls src/build/buildpnor/buildBpmFlashImages.pl to do all the necessary
+ work to bring the flash image binaries up-to-date.
+* In <fsp_sandbox>/obj/ppc/engd/nvdimm/bpm/ are 16GB-NVDIMM-BPM-CONFIG.bin,
+16GB-NVDIMM-BPM-FW.bin, 32GB-NVDIMM-BPM-CONFIG.bin, and 32GB-NVDIMM-BPM-FW.bin
+ * These are the output binaries which will be associated to LIDs for
+ hostboot use.
+
+### BPM Update Flow Overview
+The update procedure for the BPM is fairly rigid. There are many steps that must
+occur in a precise order otherwise the update will fail. We aren't able to
+communicate directly to the BPM for these updates. Instead, we send commands to
+the NVDIMM which in-turn passes those along to the BPM. There are a couple
+"modes" that must be enabled to begin the update process and be able to
+communicate with the BPM. These are:
+
+##### Update Mode
+This is a mode for the NVDIMM. To enter this mode a command is sent to the
+NVDIMM so that the NVDIMM can do some house-keeping work to prepare for the BPM
+update. Since the NVDIMM is always doing background scans of the BPM, this mode
+will quiet those scans so that we are able to communicate with the BPM.
+Otherwise, the communication would be too chaotic to perform the update.
+
+##### Boot Strap Loader (BSL) Mode (Currently, only BSL 1.4 is supported)
+This is the mode that the BPM enters in order to perform the update.In order to
+execute many of the commands necessary to perform the update, the BPM **must**
+be in BSL mode. There are varying versions of BSL mode and these versions are
+not coupled with the firmware version at all. In order for the BSL version to be
+updated on a BPM, the device must be shipped back to SMART because it requires a
+specific hardware programmer device to be updated.
+
+The update procedure does vary between BSL versions, so to ensure a successful
+update the code will first read the BSL version on the BPM. If the BSL version
+is not 1.4 (the supported version) then the update process will not occur as it
+is known that BSL versions prior to 1.4 are different enough that the update
+would fail if attempted and it is unknown if future BSL versions will be
+backward compatible with the BSL 1.4 procedure.
+
+If something happens to the firmware during an update such that the firmware on
+the device is missing or invalid, the BPM is designed to always fall back to
+this mode so that valid firmware can be loaded onto the BPM and the device can
+be recovered. However, if the firmware is corrupted by any means outside of an
+update then it is highly likely that the BPM will not be recoverable and it may
+need to be sent back to SMART for recovery.
+
+#### An update in two parts
+The BPM update cannot be done in one single pass. This is because there are two
+sections of data on the BPM that must be modified to successfully update the
+BPM. These are refered to as the Firmware portion of the update and the
+Configuration Data Segment portion of the update.
+
+##### The Firmware Portion
+This is the actual firmware update. Although, when someone says the BPM Firmware
+Update they are often implicitly referring to both parts of the update. In order
+for the full update to be a success, the firmware portion of the update is
+reliant upon another part to have access to all of the features in a given
+update. That is the Configuration Segment Data. It is safe, and advisable, to
+update the firmware part first and then the configuration part second.
+
+##### The Configuration Data Portion
+The Configuration Data Segment portion is commonly referred to as the segment
+update, config update, or any other variation of the name. The config segment
+portion **requires** working firmware on the BPM to succeed. This is because we
+must read out some of the segment data on the BPM and merge it with parts from
+the image. Without working firmware, it will not work and the update will
+_never_ succeed.
+
+The configuration data on the BPM is broken into four segments, A, B, C, and D.
+These are in reverse order in memory such that D has the lowest address offset.
+For our purposes, we only care about Segment D and B. A and C contain logging
+information and are not necessary to touch. Segment D will be completely
+replaced by the data in the image file. Segment B is the critical segment,
+however, because we must splice data from the image into it. Segment B contains
+statistical information and other valuable information that should never be lost
+during an update. If this segment becomes corrupted then it is very likely the
+BPM will be stuck in a bad state.
+
+##### Bpm::runUpdate Flow
+1. Read the current firmware version on the BPM to determine if updates are
+necessary. If this cannot be done, that is to say that an error occurs during
+this process, then updates will not be attempted due to a probable
+communication issue with the BPM.
+2. Read the current BSL mode version to determine if the BSL version on the BPM
+is compatible with the versions we support. If this cannot be done due to some
+kind of error, then the updates will not be attempted since we cannot be sure
+that the BPM has a compatible BSL version.
+3. Perform the firmware portion of the update. If an error occurs during this
+part of the update then the segment portion of the updates will not be attempt
+as per the given requirement above.
+4. Perform the segment portion of the update.
+
+##### Common Operating Processes between functions
+Reading the BSL version, and performing the firmware and segment updates all
+follow a common operating process to do their work successfully. The steps laid
+out in those functions must be followed in the given order otherwise the
+functions will not execute successfully and the BPM may go into a bad state.
+These steps are:
+1. Enter Update Mode
+2. Verify the NVDIMM is in Update Mode
+3. Command the BPM to enter BSL mode
+4. Unlock the BPM so that writing can be performed.
+5. Do function's work.
+6. Reset the BPM, which is the way that BSL mode is exited.
+7. Exit Update Mode
+
+By following these steps, the BPM is able to some background work to verify its
+state. If firmware and config updates are attempted at the same time this will
+introduce unpredicatable behavior. Meaning if only one set of steps 1-4 have
+executed then step 5a and 5b are to perform firmware and config updates, and
+then 6-7 are done that will produce unpredicable behavior. It is best run
+through the whole process for each. Reading the BSL version does not have this
+limitation. As long as steps 1-4 have been executed, the BSL version can be read
+at any time.
+
+-------------------------------------------------------------------------------
+# Node Controller (NC) Update Overview
+To support different firmware versions released by SMART, the nvdimm_update.C
+and nvdimm_update.H files were created to facilitate upgrades and downgrades of
+the firmware version of node controllers for NVDIMM. There are two kinds of
+NVDIMM node controllers: one that supports 16GB type NVDIMMs and one that
+supports 32GB type NVDIMMs. Although they have separate image files, the update
+is functionally the same for each. This overview will not go into fine-grain
+detail on every process of the update. For more information see the comments in
+nvdimm_update.H, nvdimm_update.C and in the various supporting files.
+
+Supporting Files:
+* Two signed image files are provided by SMART.
+ The name contains the NC type (16GB or 32GB) + the version (v##)
+
+ Example:
+ nvc4_fpga_31mm_X4_16GB_A7_2TLC_GA6_IBM_JEDEC_2019_03_22_v30_r29325-SIGNED.bin
+ nvc4_fpga_31mm_X4_32GB_A7_2TLC_GA6_IBM_JEDEC_2019_03_22_v30_r29325-SIGNED.bin
+
+* Files checked into cmvc/build process
+ Note: Each file contains two bytes that describe the NC type and version, so
+ we can use a generic name in CMVC
+ * NVDIMM_SRN7A2G4IBM26MP1SC.bin (16GB one)
+ * NVDIMM_SRN7A4G4IBM24KP2SB.bin (32GB one)
+
+* Build process creates lid files that are loaded on system
+ * 80d00025.lid (secure content LID)
+ * 81e00640.lid (signed 16GB)
+ * 81e00641.lid (signed 32GB)
+
+### NC Update Flow Overview
+The update procedure for the NC is fairly rigid. There are many steps that must
+occur in a precise order otherwise the update will fail.
+
+### Design points
+Three classes are used for the NC update
+* NvdimmsUpdate -- container/driver class
+ This is where all the functional NVDIMM NCs are checked and updated if necessary
+* NvdimmLidImage -- accessors for a given NC LID image (16 or 32)
+ This provides the LID content for easy checking and use during update
+* NvdimmInstalledImage -- accessor to current installed NC image
+ This is the main workhorse. It uses i2c communication to check what is
+ installed and performs the update to a new LID image level
+
+##### NvdimmsUpdate::runUpdate Flow
+1. Build up installed NVDIMM image lists (determine what NC types are installed)
+2. Using secure content lid, now call runUpdateUsingLid() for each LID type
+with the appropriate target NVDIMMs associated with that type.
+3. runUpdateUsingLid() cycles through each NVDIMM target and checks if the
+current NC level is different then the lid version level.
+Only update if the levels do not match to allow upgrade and downgrading.
+4. NvdimmInstalledImage::updateImage() is called on each NVDIMM node controller
+that requires an update
+5. updateImage runs through the steps outlined in 9.7 Firmware Update workflow
+in the JEDEC document JESD245B
+6. Basic steps of the update done one NVDIMM controller at a time
+ 1. Validate module manufacturer ID and module product identifier (done before this)
+ 2. Verify 'Operation In Progress' bit in the NVDIMM_CMD_STATUS0
+ register is cleared (ie. NV controller is NOT busy)
+ 3. Make sure we start from a cleared state
+ 4. Enable firmware update mode
+ 5. Clear the Firmware Operation status
+ 6. Clear the firmware data block to ensure there is no residual data
+ 7. Send the first part (header + SMART signature) of the Firmware Image Data
+ Include sending data and checking checksum after data is sent
+ 8. Command the module to validate that the firmware image is valid for
+ the module based on the header
+ 9. Commit the first firmware data region
+ 10. Send and commit the remaining firmware data in REGION_BLOCK_SIZE regions
+ - each block is 32 bytes
+ - each region contains upto REGION_BLOCK_SIZE blocks (currently 1,024)
+ - each region is verfied by checksum before next region is sent
+ 11. Command the module to validate the firmware data
+ 12. Disable firmware update mode
+ 13. Switch from slot0 to slot1 which contains the new image code
+ 14. Validate running new code level
+
+# NVDIMM Secure Erase Verify Flow
+DS8K lpar -> HBRT NVDIMM operation = factory_default + secure_erase_verify_start
+ HBRT executes factory_default and steps 1) and 2)
+DS8K lpar -> HBRT NVDIMM operation = secure_erase_verify_complete
+ HBRT executes step 3)
+ If secure erase verify has not completed, return status with verify_complete bit = 0
+ DS8K lpar is responsible for monitoring elapsed time (2/4 hours) and restart process (step 6)
+ If secure erase verify has completed
+ HBRT executes steps 4) and 5), generating error logs for any non-zero register values
+ Return status with verify_complete bit = 1
+
+## Procedure Flow for NVDIMM Secure Erase Verify
+ *Note: Secure Erase Verify should only be run after a Factory Default operation.
+ Secure Erase Verify is intended to verify whether all NAND blocks have been erased.
+ *Note: Full breakout of all Page 5 Secure Erase Verify registers can be found in
+ SMART document "JEDEC NVDIMM Vendor Page 2 Extensions".
+ 1) Set Page 5 Register 0x1B to value "0x00"
+ // this clears the status register
+ 2) Set Page 5 Register 0x1A to value "0xC0"
+ // this kicks off the erase verify operation
+ 3) Wait for Page 5 Register 0x1A Bit 7 to be reset to value "0"
+ // i.e., the overall register value should be "0x40";
+ this means that erase verify has completed
+ a. If Page 5 Register 0x1A Bit 7 has not reset to value "0"
+ after 2 hours (16GB NVDIMM) or after 4 hours (32GB NVDIMM),
+ report a timeout error and skip to step (6)
+ 4) Read Page 5 Register 0x1B; value should be "0x00"
+ // this is the erase verify status register
+ a. If Page 5 Register 0x1B value is not "0x00",
+ report any/all errors as outlined in the table at the end of this document,
+ then skip to step (6)
+ 5) Read Page 5 Registers 0x1D (MSB) and 0x1C (LSB);
+ combined the two registers should have a value of "0x0000"
+ // this is the number of chunks failing Secure Erase Verify
+ a. If the combined value of the two registers is not "0x0000",
+ report a threshold exceeded error along with the combined value of the two registers,
+ then skip to step (6)
+ 6) If any errors have been reported in steps (3), (4), or (5),
+ retry the secure erase verify operation starting again from step (1)
+ a. If the secure erase verify operation fails even after retrying,
+ report that secure erase verify operation has failed
+ 7) If no errors have been reported, report that secure erase verify operation
+ has been completed successfully
+ *Addendum: Breakout of Page 5 Register 0x1B Erase Verify Status bit values referenced in step (4) above.
+ All these bits should return as "0". Any bits returning as "1" should be reported with the error name below.
+ Bits 7:6 - Reserved
+ Bit 5 - BAD BLOCK
+ Bit 4 - OTHER
+ Bit 3 - ENCRYPTION LOCKED
+ Bit 2 - INVALID PARAMETER
+ Bit 1 - INTERRUPTED
+ Bit 0 - NAND ERROR
+
OpenPOWER on IntegriCloud