summaryrefslogtreecommitdiffstats
path: root/core/hmi.c
Commit message (Collapse)AuthorAgeFilesLines
* opal/hmi: Initialize the hmi event with old value of TFMR.Mahesh Salgaonkar2019-04-171-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | Do this before we fix TFAC errors. Otherwise the event at host console shows no thread error reported in TFMR register. Without this patch the console event show TFMR with no thread error: (DEC parity error TFMR[59] injection) [ 53.737572] Severe Hypervisor Maintenance interrupt [Recovered] [ 53.737596] Error detail: Timer facility experienced an error [ 53.737611] HMER: 0840000000000000 [ 53.737621] TFMR: 3212000870e04000 After this patch it shows old TFMR value on host console: [ 2302.267271] Severe Hypervisor Maintenance interrupt [Recovered] [ 2302.267305] Error detail: Timer facility experienced an error [ 2302.267320] HMER: 0840000000000000 [ 2302.267330] TFMR: 3212000870e14010 Fixes: 674f7696f ("opal/hmi: Rework HMI handling of TFAC errors") Cc: skiboot-stable@lists.ozlabs.org Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* opal/hmi: Never trust a cow!Frederic Barrat2019-04-091-1/+1
| | | | | | | | | | With opencapi, it's fairly common to trigger HMIs during AFU development on the FPGA, by not replying in time to an NPU command, for example. So shift the blame reported by that cow to avoid crowding my mailbox. Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* hw/npu2: Dump (more) npu2 registers on link error and HMIsFrederic Barrat2019-04-091-57/+1
| | | | | | | | | | | | | | | | | | We were already logging some NPU registers during an HMI. This patch cleans up a bit how it is done and separates what is global from what is specific to nvlink or opencapi. Since we can now receive an error interrupt when an opencapi link goes down unexpectedly, we also dump the NPU state but we limit it to the registers of the brick which hit the error. The list of registers to dump was worked out with the hw team to allow for proper debugging. For each register, we print the name as found in the NPU workbook, the scom address and the register value. Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com> Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* opal/hmi: set a flag to inform OS that TOD/TB has failed.Mahesh Salgaonkar2019-03-051-1/+3
| | | | | | | | | | Set a flag to indicate OS about TOD/TB failure as part of new opal_handle_hmi2 handler. This flag then can be used by OS to make sure functions depending on TB value (e.g. udelay()) are aware of TB not ticking. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* opal/hmi: Fix double unlock of hmi lock in failure path.Mahesh Salgaonkar2019-03-051-5/+1
| | | | | | | unlock once and goto error_out. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* opal/hmi: Wakeup the cpu before reading core_firVaibhav Jain2018-10-161-6/+15
| | | | | | | | | | | | | | | | | | | | | When stop state 5 is enabled, reading the core_fir during an HMI can result in a xscom read error with xscom_read() returning an OPAL_XSCOM_PARTIAL_GOOD error code and core_fir value of all FFs. At present this return error code is not handled in decode_core_fir() hence the invalid core_fir value is sent to the kernel where it interprets it as a FATAL hmi causing a system check-stop. This can be prevented by forcing the core to wake-up using before reading the core_fir. Hence this patch wraps the call to read_core_fir() within calls to dctl_set_special_wakeup() and dctl_clear_special_wakeup(). Suggested-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com> Signed-off-by: Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com> Acked-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com> Reviewed-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* opal/hmi: Handle early HMIs on thread0 when secondaries are still in OPAL.Mahesh Salgaonkar2018-09-271-0/+49
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When primary thread receives a CORE level HMI for timer facility errors while secondaries are still in OPAL, thread 0 ends up in rendez-vous waiting for secondaries to get into hmi handling. This is because OPAL runs with MSR(EE=0) and hence HMIs are delayed on secondary threads until they are given to Linux OS. Fix this by adding a check for secondary state and force them in hmi handling by queuing job on secondary threads. I have tested this by injecting HDEC parity error very early during Linux kernel boot. Recovery works fine for non-TB errors. But if TB is bad at this very eary stage we already doomed. Without this patch we see: [ 285.046347408,7] OPAL: Start CPU 0x0843 (PIR 0x0843) -> 0x000000000000a83c [ 285.051160609,7] OPAL: Start CPU 0x0844 (PIR 0x0844) -> 0x000000000000a83c [ 285.055359021,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000 [ 285.055361439,7] HMI: [Loc: U78D3.ND1.WZS004A-P1-C48]: P:8 C:17 T:0: TFMR(2e12002870e14000) Timer Facility Error [ 286.232183823,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 1 (sptr=0000ccc1) [ 287.409002056,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 2 (sptr=0000ccc1) [ 289.073820164,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 3 (sptr=0000ccc1) [ 290.250638683,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 1 (sptr=0000ccc2) [ 291.427456821,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 2 (sptr=0000ccc2) [ 293.092274807,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 3 (sptr=0000ccc2) [ 294.269092904,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 1 (sptr=0000ccc3) [ 295.445910944,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 2 (sptr=0000ccc3) [ 297.110728970,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 3 (sptr=0000ccc3) After this patch: [ 259.401719351,7] OPAL: Start CPU 0x0841 (PIR 0x0841) -> 0x000000000000a83c [ 259.406259572,7] OPAL: Start CPU 0x0842 (PIR 0x0842) -> 0x000000000000a83c [ 259.410615534,7] OPAL: Start CPU 0x0843 (PIR 0x0843) -> 0x000000000000a83c [ 259.415444519,7] OPAL: Start CPU 0x0844 (PIR 0x0844) -> 0x000000000000a83c [ 259.419641401,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000 [ 259.419644124,7] HMI: [Loc: U78D3.ND1.WZS004A-P1-C48]: P:8 C:17 T:0: TFMR(2e12002870e04000) Timer Facility Error [ 259.419650678,7] HMI: Sending hmi job to thread 1 [ 259.419652744,7] HMI: Sending hmi job to thread 2 [ 259.419653051,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000 [ 259.419654725,7] HMI: Sending hmi job to thread 3 [ 259.419654916,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000 [ 259.419658025,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000 [ 259.419658406,7] HMI: [Loc: U78D3.ND1.WZS004A-P1-C48]: P:8 C:17 T:2: TFMR(2e12002870e04000) Timer Facility Error [ 259.419663095,7] HMI: [Loc: U78D3.ND1.WZS004A-P1-C48]: P:8 C:17 T:3: TFMR(2e12002870e04000) Timer Facility Error [ 259.419655234,7] HMI: [Loc: U78D3.ND1.WZS004A-P1-C48]: P:8 C:17 T:1: TFMR(2e12002870e04000) Timer Facility Error [ 259.425109779,7] OPAL: Start CPU 0x0845 (PIR 0x0845) -> 0x000000000000a83c [ 259.429870681,7] OPAL: Start CPU 0x0846 (PIR 0x0846) -> 0x000000000000a83c [ 259.434549250,7] OPAL: Start CPU 0x0847 (PIR 0x0847) -> 0x000000000000a83c Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* opal/hmi: Ignore debug trigger inject core FIR.Mahesh Salgaonkar2018-09-171-1/+0
| | | | | | | | | | | | | Core FIR[60] is a side effect of the work around for the CI Vector Load issue in DD2.1. Usually this gets delivered as HMI with HMER[17] where Linux already ignores it. But it looks like in some cases we may happen to see CORE_FIR[60] while we are already in Malfunction Alert HMI (HMER[0]) due to other reasons e.g. CAPI recovery or NPU xstop. If that happens then just ignore it instead of crashing kernel as not recoverable. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Reviewed-by: Vaibhav Jain <vaibhav@linux.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* opal/hmi: Catch NPU2 HMIs for opencapiFrederic Barrat2018-08-131-5/+10
| | | | | | | | | HMIs for NPU2 are filtered with the 'compatible' string of the PHB, so add opencapi to the mix. Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com> Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* hw/npu2, core/hmi: Use NPU instead of NPU2 as log message prefixAndrew Donnellan2018-06-271-3/+3
| | | | | | | | | | | | | | | | The NPU2{DBG,INF,ERR} macros use "NPU%d" as a prefix to identify messages relating to a particular NPU. It's slightly confusing to have per-NPU messages prefixed with "NPU0" or "NPU1" and NPU-generic messages prefixed with "NPU2". On some future system we could potentially have a NPU #2 in which case it'd be really confusing. Use NPU rather than NPU2 for NPU-generic log messages. There's no risk of confusion with the original npu.c code since that's only for P8. Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Acked-by: Reza Arbab <arbab@linux.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* opal/hmi: Display correct chip id while printing NPU FIRs.Mahesh Salgaonkar2018-06-041-4/+4
| | | | | | | | | | | | | | HMIs for NPU xstops are broadcasted to all chips. All cores on all the chips receive HMI. HMI handler correctly identifies and extracts the NPU FIR details from affected chip, but while printing FIR data it prints chip id and location code details of this_cpu()->chip_id which may not be correct. This patch fixes this issue. CC: stable # v6.0+ Fixes: 7bcbc78c ("Add location code to NPU2 HMI logging") Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> [stewart: add fixes and cc stable] Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* Add location code to NPU2 HMI loggingBalbir Singh2018-05-151-8/+14
| | | | | | | | | | | | | | | | The current HMI error message does not specifiy where the HMI error occured. The original error message was NPU: FIR#0 FIR 0x0080100000000000 mask 0x009a48180f01ffff The enhanced error message is NPU2: [Loc: UOPWR.0000000-Node0-Proc0] P:0 FIR#0 FIR 0x0000100000000000 mask 0x009a48180f03ffff Signed-off-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* hmi: Fix clearing HMER on debug triggerMichael Neuling2018-05-091-0/+1
| | | | | | | | | | | | | | | In the recent patch: eddff9bf40 hmi: Clear unknown debug trigger I rebased the code from an older skiboot before the HMI rework. When I did this, I missed the handled flag. Without this the HMER is not cleared properly and the HMI keeps happening. This properly sets the handled flag and hence clears the HMER bit. Signed-off-by: Michael Neuling <mikey@neuling.org> Reviewed-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* hmi: Clear unknown debug triggerRyan Grimm2018-05-041-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | On some systems, seeing hangs like this when Linux starts: [ 170.027252763,5] OCC: All Chip Rdy after 0 ms [ 170.062930145,5] INIT: Starting kernel at 0x20011000, fdt at 0x30ae0530 366247 bytes) [ 171.238270428,5] OPAL: Switch to little-endian OS If you look at the in memory skiboot console (or do 'nvram -p ibm,skiboot --update-config log-level-driver=7') we see the console get spammed with: [ 5209.109790675,7] HMI: Received HMI interrupt: HMER = 0x0000400000000000 [ 5209.109792716,7] HMI: Received HMI interrupt: HMER = 0x0000400000000000 [ 5209.109794695,7] HMI: Received HMI interrupt: HMER = 0x0000400000000000 [ 5209.109796689,7] HMI: Received HMI interrupt: HMER = 0x0000400000000000 We're taking the debug trigger (bit 17) early on, before the hmi_debug_trigger function in the kernel is set up. This clears the HMI in Skiboot and reports to the kernel instead of bringing down the machine. Signed-off-by: Ryan Grimm <grimm@linux.vnet.ibm.com> Signed-off-by: Michael Neuling <mikey@neuling.org> Reviewed-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* core/hmi: assign flags=0 in case nothing set by handle_hmi_exceptionStewart Smith2018-05-041-1/+1
| | | | | | | | | | Practically speaking, I don't think you'd *currently* hit this. Found with Clang's scan-build. Signed-off-by: Stewart Smith <stewart@linux.ibm.com> Acked-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* opal/hmi: Generate one event per core for processor recovery.Mahesh Salgaonkar2018-04-241-3/+3
| | | | | | | | | Processor recovery is per core error. All threads on that core receive HMI. All threads don't need to generate HMI event for same error. Let thread 0 only generate the event. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* opal:hmi: Add missing processor recovery reason string.Mahesh Salgaonkar2018-04-241-0/+1
| | | | | | | | | | | With this patch now we see reason string printed for CORE_WOF[43] bit. [ 477.352234986,7] HMI: [Loc: U78D3.001.WZS004A-P1-C48]: P:8 C:22 T:3: Processor recovery occurred. [ 477.352240742,7] HMI: Core WOF = 0x0000000000100000 recovered error: [ 477.352242181,7] HMI: PC - Thread hang recovery Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* opal/hmi: Generate hmi event for recovered HDEC parity error.Mahesh Salgaonkar2018-04-171-3/+2
| | | | | Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* opal/hmi: check thread 0 tfmr to validate latched tfmr errors.Mahesh Salgaonkar2018-04-171-19/+42
| | | | | | | | | | | Due to P9 errata, HDEC parity and TB residue errors are latched for non-zero threads 1-3 even if they are cleared. But these are not latched on thread 0. Hence, use xscom SCOMC/SCOMD to read thread 0 tfmr value and ignore them on non-zero threads if they are not present on thread 0. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* opal/hmi: Print additional debug information in rendezvous.Mahesh Salgaonkar2018-04-171-2/+4
| | | | | | | Helps in debugging... Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* opal/hmi: Fix handling of TFMR parity/corrupt error.Mahesh Salgaonkar2018-04-171-5/+4
| | | | | | | | | | | | | | | | | | | While testing TFMR parity/corrupt error it has been observed that HMIs are delivered twice for this error - First time HMI is delivered with HMER[4,5]=1 and TFMR[60]=1. - Second time HMI is delivered with HMER[4,5]=1 and TFMR[60]=0 with valid TB. On second HMI we end up throwing below error message even though TB is in valid state. "HMI: TB invalid without core error reported" This patch fixes this issue by ignoring HMER[5] and checking only for TFMR[60] before setting this_cpu()->tb_invalid to true. Suggested-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* opal/hmi: Fix soft lockups during TOD errorsMahesh Salgaonkar2018-04-171-1/+15
| | | | | | | | | | | | | | | | | | | | | | | | | There are some TOD errors which do not affect working of TOD and TB. They stay in valid state. Hence we don't need rendez vous for TOD errors that does not affect TB working. TOD errors that affects TOD/TB will report a global error on TFMR[44] alongwith bit 51, and they will go in rendez vous path as expected. But the TOD errors that does not affect TB register sets only TFMR bit 51. The TFMR bit 51 is cleared when any single thread clears the TOD error. Once cleared, the bit 51 is reflected to all the cores on that chip. Any thread that reads the TFMR register after the error is cleared will see TFMR bit 51 reset. Hence the threads that see TFMR[51]=1, falls through rendez-vous path and threads that see TFMR[51]=0, returns doing nothing. This ends up in a soft lockups in host kernel. This patch fixes this issue by not considering TOD interrupt (TFMR[51]) as a core-global error and hence avoiding rendez-vous path completely. Instead threads that see TFMR[51]=1 will now take different path that just do the TOD error recovery. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* opal/hmi: Do not send HMI event if no errors are found.Mahesh Salgaonkar2018-04-171-8/+13
| | | | | | | | | | | | | For TOD errors, all the cores in the chip get HMIs. Any one thread from any core can fix the issue and TFMR will have error conditions cleared. Rest of the threads need take any action if TOD errors are already cleared. Hence thread 0 of every core should get a fresh copy of TFMR before going ahead recovery path. Initialize recover = -1, so that if no errors found that thread need not send a HMI event to linux. This helps in stop flooding host with hmi event by every thread even there are no errors found. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* opal/hmi: Initialize the hmi event with old value of HMER.Mahesh Salgaonkar2018-04-171-3/+6
| | | | | | | | | | | | | | | | | | | | | | Do this before we check for TFAC errors. Otherwise the event at host console shows no error reported in HMER register. Without this patch the console event show HMER with all zeros [ 216.753417] Severe Hypervisor Maintenance interrupt [Recovered] [ 216.753498] Error detail: Timer facility experienced an error [ 216.753509] HMER: 0000000000000000 [ 216.753518] TFMR: 3c12000870e04000 After this patch it shows old HMER values on host console: [ 2237.652533] Severe Hypervisor Maintenance interrupt [Recovered] [ 2237.652651] Error detail: Timer facility experienced an error [ 2237.652766] HMER: 0840000000000000 [ 2237.652837] TFMR: 3c12000870e04000 Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* opal/hmi: Rework HMI handling of TFAC errorsBenjamin Herrenschmidt2018-04-171-292/+231
| | | | | | | | | | | This patch reworks the HMI handling for TFAC errors by introducing 4 rendez-vous points improve the thread synchronization while handling timebase errors that requires all thread to clear dirty data from TB/HDEC register before clearing the errors. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* opal/hmi: Don't bother passing HMER to pre-recovery cleanupBenjamin Herrenschmidt2018-04-171-14/+6
| | | | | | | | The test for TFAC error is now redundant so we remove it and remove the HMER argument. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* opal/hmi: Move timer related error handling to a separate functionBenjamin Herrenschmidt2018-04-171-48/+58
| | | | | | | | Currently no functional change. This is a first step to completely rewriting how these things are handled. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* opal/hmi: Add a new opal_handle_hmi2 that returns direct info to LinuxBenjamin Herrenschmidt2018-04-171-45/+82
| | | | | | | | | It returns a 64-bit flags mask currently set to provide info about which timer facilities were lost, and whether an event was generated. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* opal/hmi: Remove races in clearing HMERBenjamin Herrenschmidt2018-04-171-10/+12
| | | | | | | | | | | | Writing to HMER acts as an "AND". The current code writes back the value we originally read with the bits we handled cleared. This is racy, if a new bit gets set in HW after the original read, we'll end up clearing it without handling it. Instead, use an all 1's mask with only the bit handled cleared. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* opal/hmi: Don't re-read HMER multiple timesBenjamin Herrenschmidt2018-04-171-21/+14
| | | | | | | | | We want to make sure all reporting and actions are based upon the same snapshot of HMER in case bits get added by HW while we are in OPAL. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* NPU2: dump NPU2 registers on npu2 HMIStewart Smith2018-03-271-2/+73
| | | | | | | | | | | Due to the nature of debugging npu2 issues, folk are wanting the full list of NPU2 registers dumped when there's a problem. We have to list out each register as traversing the range triggers FIR bits that confuse PRD. Suggested-by: Ryan Black <rblack@us.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
* Revert "NPU2 HMIs: dump out a *LOT* of npu2 registers for debugging"Stewart Smith2018-03-271-37/+1
| | | | | | | | | | | | | This reverts commit fbdc91e693fc3103f7e2a65054ed32bfb26a2e17. We don't need this as we need to do it a different way, with a explicit set of registers as otherwise we trip other random FIR bits and everything becomes even more terrible. I suggest alcohol. Cc: stable Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
* core/hmi: report processor recovery reason from core FIR bits on P9Nicholas Piggin2018-03-011-3/+59
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When an error is encountered that causes processor recovery, HMI is generated if the recovery was successful. The reason is recorded in the core FIR, which gets copied into the WOF. In this case dump the WOF register and an error string into the OPAL msglog. A broken init setting led to HMIs reported in Linux as: [ 3.591547] Harmless Hypervisor Maintenance interrupt [Recovered] [ 3.591648] Error detail: Processor Recovery done [ 3.591714] HMER: 2040000000000000 This patch would have been useful because it tells us exactly that the problem is in the d-side ERAT: [ 414.489690798,7] HMI: Received HMI interrupt: HMER = 0x2040000000000000 [ 414.489693339,7] HMI: [Loc: UOPWR.0000000-Node0-Proc0]: P:0 C:1 T:1: Processor recovery occurred. [ 414.489699837,7] HMI: Core WOF = 0x0000000410000000 recovered error: [ 414.489701543,7] HMI: LSU - SRAM (DCACHE parity, etc) [ 414.489702341,7] HMI: LSU - ERAT multi hit In future it will be good to unify this reporting, so Linux could print something more useful. Until then, this gives some good data. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
* NPU2 HMIs: dump out a *LOT* of npu2 registers for debuggingStewart Smith2018-02-281-1/+37
| | | | | | | | | | This is not the way we want to end up doing this. This is a hack to make folk happy and not require crondump to debug nvidia/npu2 issues. Cc: stable Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
* hw/npu2: Implement logging HMI actionsBalbir Singh2018-02-081-1/+82
| | | | | | | | | Log HMI errors as step 1. OS will need to deduce and interpret the HMI event. Signed-off-by: Balbir Singh <bsingharora@gmail.com> Acked-by: Alistair Popple <alistair@popple.id.au> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
* core/hmi: Display chip location code while displaying core FIR.Mahesh Salgaonkar2017-12-131-1/+4
| | | | | | | No functionality change. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
* core/hmi: Do not display FIR details if none of the bits are set.Mahesh Salgaonkar2017-12-131-0/+3
| | | | | | | | So that we don't flood OPAL console logs with information that is not useful. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
* opal/hmi: HMI logging with location code info.Mahesh Salgaonkar2017-12-131-4/+36
| | | | | | | | | | | | | | | | | Add few HMI debug prints with location code info few additional info. No functionality change. With this patch the log messages will look like: [210612.175196744,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000 [210612.175200449,7] HMI: [Loc: UOPWR.1302LFA-Node0-Proc1]: P:8 C:16 T:1: TFMR(2d12000870e04020) Timer Facility Error [210660.259689526,7] HMI: Received HMI interrupt: HMER = 0x2040000000000000 [210660.259695649,7] HMI: [Loc: UOPWR.1302LFA-Node0-Proc0]: P:0 C:16 T:1: Processor recovery Done. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
* core/hmi: Use pr_fmt macro for tagging log messagesMahesh Salgaonkar2017-12-131-16/+19
| | | | | | | | No functionality changes. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Acked-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
* opal/hmi: Workaround Power9 hw logic bug for couple of TFMR TB errors.Mahesh Salgaonkar2017-10-231-1/+27
| | | | | | | | | | | | | | | | | | | | | | | | | Add a workaround for a HW logic bug in Power9 where TB residue and HDEC parity errors cleared by one thread aren't visible to other threads of same core. The TB reside and HDEC parity error are reported through TFMR bit 45 and 26 respectively. If any of the thread from the core clears the TFMR bit 26 and 45, only thread 0 is able to see that errors are cleared but rest of the threads 1, 2 and 3 do not see those as cleared. This causes TB error recovery to fail for TB residue and HDEC parity errors. TFMR is per core register and any changes made by a one thread should be visible by other threads of the same core. On TB residue error (TFMR bit 45), TB goes into invalid state. Hence avoid handling/clearing TB residue error if TB is valid and running. Use TFMR bit 41 to check validity of TB state. For HDEC parity error (TFMR bit 26), check for other errors on TFMR register and ignore the pre-recovery for HDEC parity error. If TFMR has any other TB error bits set alongwith HDEC parity error we can safely ignore handling of HDEC parity error. Also, while clearing HDEC parity error bit from TFMR, allow only thread 0 to clear it. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
* opal/hmi: Fix TB reside and HDEC parity error recovery for power9Mahesh Salgaonkar2017-10-231-3/+102
| | | | | | | | | | | On TB/HDEC errors, all 4 threads on the affected receives HMI. On power9, every thread on the core has its own copy of TB/HDEC and hence every thread has to clear the dirty data from its own TB/HDEC register before we clear tb errors through TFMR[24]. The HMI recovery would fail even if one thread do not cleanup the respective TB/HDEC register. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
* core/hmi: Fix use of uninitialised value (CID 147808)Cyril Bur2017-08-151-3/+3
| | | | | | | | | | | | | A rework of where some of the xscom regs are for POWER9 has resulted in a scope issue where the same line attempts to simultaneously reference a variable by the same name in global and function scope. Change the value read by xscom_read to *_val Fixes: CID 147808 Fixes: bda5e0ea Fix scom addresses for power9 nx checkstop hmi handling. Signed-off-by: Cyril Bur <cyril.bur@au1.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
* Fix scom addresses for power9 nx checkstop hmi handling.Mahesh Salgaonkar2017-07-071-7/+20
| | | | | | | | Scom addresses for NX status, DMA & ENGINE FIR and PBI FIR has changed for Power9. Fixup thoes while handling nx checkstop for Power9. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
* Fix scom addresses for power9 core checkstop hmi handling.Mahesh Salgaonkar2017-07-071-5/+56
| | | | | | | | | | | | Scom addresses for CORE FIR (Fault Isolation Register) and Malfunction Alert Register has changed for Power9. Fixup those while handling core checkstop for Power9. Without this change HMI handler fails to check for correct reason for core checkstop on Power9. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
* capi: Handle HMI eventsChristophe Lombard2017-06-191-83/+43
| | | | | | | | | | | | Find the CAPP on the chip associated with the HMI event for PHB4. The recovery mode (re-initialization of the capp, resume of functional operations) is only available with P9 DD2. A new patch will be provided to support this feature. Signed-off-by: Christophe Lombard <clombard@linux.vnet.ibm.com> Reviewed-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
* capi: Move phb3 capp registers to specialized filesChristophe Lombard2017-06-191-1/+1
| | | | | | | | | | | | | The definitions of the CAPP registers for PHB3 are moved in a specific file. The updated file capp.h will be used for the common functionalities about the CAPP for PHB3 and PHB4. Signed-off-by: Christophe Lombard <clombard@linux.vnet.ibm.com> Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Reviewed-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
* Convert important polling loops to spin at lowest SMT priorityNicholas Piggin2017-06-061-1/+3
| | | | | | | | | | | The pattern of calling cpu_relax() inside a polling loop does not suit the powerpc SMT priority instructions. Prefrred is to set a low priority then spin until break condition is reached, then restore priority. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> [stewart@linux.vnet.ibm.com: fixup lpc-uart wait_tx_room() and unit test] Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
* hmi: Print CAPP FIR information when handling CAPP malfunction alertsAndrew Donnellan2017-03-161-0/+29
| | | | | | | | | | | | | When diagnosing or debugging CAPP errors, it's rather useful to have the CAPP FIR, which often provides very helpful information. Print the CAPP FIR to the log when we handle a Malfunction Alert HMI for a CAPP error. Cc: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Reviewed-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
* hmi: refactor malfunction alert handlingAndrew Donnellan2017-03-161-46/+42
| | | | | | | | | | | | | | | | | The logic in decode_malfunction() is rather tricky to follow. Refactor the code to make it easier to follow. No functional change. Cc: Russell Currey <ruscur@russell.cc> Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Cc: Ryan Grimm <grimm@linux.vnet.ibm.com> Cc: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Cc: Daniel Axtens <dja@axtens.net> Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Reviewed-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Reviewed-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
* fast-reboot: disable on FSP code update or unrecoverable HMIStewart Smith2016-10-251-0/+2
| | | | | | | | Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com> Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> [stewart@linux.vnet.ibm.com: unlock before return (suggested by Mahesh/Andrew), disable only on non-cancelling fsp codeupdate call (suggested by Vasant)] Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
OpenPOWER on IntegriCloud