diff options
| author | Andrew Geissler <andrewg@us.ibm.com> | 2017-02-03 16:19:14 -0600 |
|---|---|---|
| committer | Patrick Williams <patrick@stwcx.xyz> | 2017-02-14 20:48:06 +0000 |
| commit | a34160bea04508d8ff403c2b0fb59d4568bf52d9 (patch) | |
| tree | 77ff889c92ba11b7a21223e127e481585809be81 | |
| parent | 37a5d0af3cd5bfc7dc97d367e61a850e962f87c9 (diff) | |
| download | openbmc-docs-a34160bea04508d8ff403c2b0fb59d4568bf52d9.tar.gz openbmc-docs-a34160bea04508d8ff403c2b0fb59d4568bf52d9.zip | |
Define openbmc systemd target and service error handling
Change-Id: Icc83ca78b7b485374d3a922fe5372c71390b3b67
Signed-off-by: Andrew Geissler <andrewg@us.ibm.com>
| -rw-r--r-- | openbmc-systemd.md | 68 |
1 files changed, 68 insertions, 0 deletions
diff --git a/openbmc-systemd.md b/openbmc-systemd.md index d52dae8..1a91ce7 100644 --- a/openbmc-systemd.md +++ b/openbmc-systemd.md @@ -82,3 +82,71 @@ xyz.openbmc_project.State.Host RequestedHostTransition s xyz.openbmc_project.State.Host.Transition.On Underneath the covers, this is calling systemd with obmc-chassis-start@0.target + +## Error Handling of Systemd +With great numbers of targets and services, come great chances for failures. +To make OpenBMC a robust and productive system, it needs to be sure to have an +error handling policy for when services and their targets fail. + +When a failure occurs, the OpenBMC software needs to notify the users of the +system and provide mechanisms for either the system to automatically retry the +failed operation (i.e. reboot the system) or to stay in a quiesced state so that +error data can be collected and the failure can be investigated. + +There are two main failure scenarios when it comes to OpenBMC and systemd usage: + +1. A service within a target fails +- If the service is a "oneshot" type, and the service is required +(not wanted) by the target then the target will fail if the service +fails + - Define a behavior for when the target fails using the + "OnFailure" option (i.e. go to a new failure target if any required + service fails) +- If the service is not a "oneshot", then it can not fail the target +(the target only knows that it started successfully) + - Define a behavior for when the service fails (OnFailure) + option. + - The service can not have "RemainAfterExit=yes" otherwise, the OnFailure + action does not occur until the service is stopped (instead of when it + fails) + - *See more information below on [RemainAfterExit](#RemainAfterExit) + +2. A failure outside of a normal systemd target/service (host watchdog expires, +host checkstop detected) +- The service which detects this failure is responsible for logging the +appropriate error, and instructing systemd to go to the appropriate target + +Within OpenBMC, there is a host quiesce target. This is the target that other +host related targets should go to when they hit a failure. Other software within +OpenBMC can then monitor for the entry into this quiesce target and will handle +the halt vs. automatic reboot functionality. + +Targets which are not host related, will need special thought in regards to +their error handling. For example, the target responsible for applying chassis +power, obmc-power-chassis-on@0.target, will have a +"OnFailure=obmc-power-chassis-off@%i.target" error path. That is, if the +chassis power on target fails then power off the chassis. + +The above info sets up some general **guidelines** for our host related +targets and services: + +- All targets should have an "OnFailure=obmc-quiesce-host@.target" +- All services which are required for a target to achieve its function should +be RequiredBy that target (not WantedBy) +- All services should first try to be "Type=oneshot" so that we can just rely on +the target failure path +- If a service can not be "Type=oneshot", then it needs to have a +"OnFailure=obmc-quiesce-host@.target" and ideally set "RemainAfterExit=no" +(but see caveats on this below) +- If a service can not be any of these then it's up to the service application +to call systemd with the obmc-quiesce-host@.target on failures + +### RemainAfterExit +This is set to "yes" for most OpenBMC services to handle the situation where +someone starts the same target twice. If the associated service with that +target is not running (i.e. RemainAfterExit=no), then the service will be +executed again. Think about someone accidentally running the +obmc-chassis-start@.target twice. If you execute it when the operating system +is up and running, and the service which toggles the pgood pin is re-executed, +you're going to crash your system. Given this info, the goal should always be +to write "oneshot" services that have RemainAfterExit set to yes. |

