| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Do this before we fix TFAC errors. Otherwise the event at host console
shows no thread error reported in TFMR register.
Without this patch the console event show TFMR with no thread error:
(DEC parity error TFMR[59] injection)
[ 53.737572] Severe Hypervisor Maintenance interrupt [Recovered]
[ 53.737596] Error detail: Timer facility experienced an error
[ 53.737611] HMER: 0840000000000000
[ 53.737621] TFMR: 3212000870e04000
After this patch it shows old TFMR value on host console:
[ 2302.267271] Severe Hypervisor Maintenance interrupt [Recovered]
[ 2302.267305] Error detail: Timer facility experienced an error
[ 2302.267320] HMER: 0840000000000000
[ 2302.267330] TFMR: 3212000870e14010
Fixes: 674f7696f ("opal/hmi: Rework HMI handling of TFAC errors")
Cc: skiboot-stable@lists.ozlabs.org
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
|
|
| |
With opencapi, it's fairly common to trigger HMIs during AFU
development on the FPGA, by not replying in time to an NPU command,
for example. So shift the blame reported by that cow to avoid crowding
my mailbox.
Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We were already logging some NPU registers during an HMI. This patch
cleans up a bit how it is done and separates what is global from what
is specific to nvlink or opencapi.
Since we can now receive an error interrupt when an opencapi link goes
down unexpectedly, we also dump the NPU state but we limit it to the
registers of the brick which hit the error.
The list of registers to dump was worked out with the hw team to
allow for proper debugging. For each register, we print the name as
found in the NPU workbook, the scom address and the register value.
Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com>
Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
|
|
| |
Set a flag to indicate OS about TOD/TB failure as part of new
opal_handle_hmi2 handler. This flag then can be used by OS to make sure
functions depending on TB value (e.g. udelay()) are aware of TB not
ticking.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
| |
unlock once and goto error_out.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When stop state 5 is enabled, reading the core_fir during an HMI can
result in a xscom read error with xscom_read() returning an
OPAL_XSCOM_PARTIAL_GOOD error code and core_fir value of all FFs. At
present this return error code is not handled in decode_core_fir()
hence the invalid core_fir value is sent to the kernel where it
interprets it as a FATAL hmi causing a system check-stop.
This can be prevented by forcing the core to wake-up using before
reading the core_fir. Hence this patch wraps the call to
read_core_fir() within calls to dctl_set_special_wakeup() and
dctl_clear_special_wakeup().
Suggested-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Signed-off-by: Mahesh J Salgaonkar <mahesh@linux.vnet.ibm.com>
Acked-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Reviewed-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When primary thread receives a CORE level HMI for timer facility errors
while secondaries are still in OPAL, thread 0 ends up in rendez-vous
waiting for secondaries to get into hmi handling. This is because OPAL
runs with MSR(EE=0) and hence HMIs are delayed on secondary threads until
they are given to Linux OS. Fix this by adding a check for secondary
state and force them in hmi handling by queuing job on secondary threads.
I have tested this by injecting HDEC parity error very early during Linux
kernel boot. Recovery works fine for non-TB errors. But if TB is bad at
this very eary stage we already doomed.
Without this patch we see:
[ 285.046347408,7] OPAL: Start CPU 0x0843 (PIR 0x0843) -> 0x000000000000a83c
[ 285.051160609,7] OPAL: Start CPU 0x0844 (PIR 0x0844) -> 0x000000000000a83c
[ 285.055359021,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000
[ 285.055361439,7] HMI: [Loc: U78D3.ND1.WZS004A-P1-C48]: P:8 C:17 T:0: TFMR(2e12002870e14000) Timer Facility Error
[ 286.232183823,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 1 (sptr=0000ccc1)
[ 287.409002056,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 2 (sptr=0000ccc1)
[ 289.073820164,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 3 (sptr=0000ccc1)
[ 290.250638683,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 1 (sptr=0000ccc2)
[ 291.427456821,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 2 (sptr=0000ccc2)
[ 293.092274807,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 3 (sptr=0000ccc2)
[ 294.269092904,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 1 (sptr=0000ccc3)
[ 295.445910944,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 2 (sptr=0000ccc3)
[ 297.110728970,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 3 (sptr=0000ccc3)
After this patch:
[ 259.401719351,7] OPAL: Start CPU 0x0841 (PIR 0x0841) -> 0x000000000000a83c
[ 259.406259572,7] OPAL: Start CPU 0x0842 (PIR 0x0842) -> 0x000000000000a83c
[ 259.410615534,7] OPAL: Start CPU 0x0843 (PIR 0x0843) -> 0x000000000000a83c
[ 259.415444519,7] OPAL: Start CPU 0x0844 (PIR 0x0844) -> 0x000000000000a83c
[ 259.419641401,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000
[ 259.419644124,7] HMI: [Loc: U78D3.ND1.WZS004A-P1-C48]: P:8 C:17 T:0: TFMR(2e12002870e04000) Timer Facility Error
[ 259.419650678,7] HMI: Sending hmi job to thread 1
[ 259.419652744,7] HMI: Sending hmi job to thread 2
[ 259.419653051,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000
[ 259.419654725,7] HMI: Sending hmi job to thread 3
[ 259.419654916,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000
[ 259.419658025,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000
[ 259.419658406,7] HMI: [Loc: U78D3.ND1.WZS004A-P1-C48]: P:8 C:17 T:2: TFMR(2e12002870e04000) Timer Facility Error
[ 259.419663095,7] HMI: [Loc: U78D3.ND1.WZS004A-P1-C48]: P:8 C:17 T:3: TFMR(2e12002870e04000) Timer Facility Error
[ 259.419655234,7] HMI: [Loc: U78D3.ND1.WZS004A-P1-C48]: P:8 C:17 T:1: TFMR(2e12002870e04000) Timer Facility Error
[ 259.425109779,7] OPAL: Start CPU 0x0845 (PIR 0x0845) -> 0x000000000000a83c
[ 259.429870681,7] OPAL: Start CPU 0x0846 (PIR 0x0846) -> 0x000000000000a83c
[ 259.434549250,7] OPAL: Start CPU 0x0847 (PIR 0x0847) -> 0x000000000000a83c
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Core FIR[60] is a side effect of the work around for the CI Vector Load
issue in DD2.1. Usually this gets delivered as HMI with HMER[17] where
Linux already ignores it. But it looks like in some cases we may happen
to see CORE_FIR[60] while we are already in Malfunction Alert HMI
(HMER[0]) due to other reasons e.g. CAPI recovery or NPU xstop. If that
happens then just ignore it instead of crashing kernel as not recoverable.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Reviewed-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
|
| |
HMIs for NPU2 are filtered with the 'compatible' string of the PHB, so
add opencapi to the mix.
Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com>
Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The NPU2{DBG,INF,ERR} macros use "NPU%d" as a prefix to identify messages
relating to a particular NPU.
It's slightly confusing to have per-NPU messages prefixed with "NPU0" or
"NPU1" and NPU-generic messages prefixed with "NPU2". On some future system
we could potentially have a NPU #2 in which case it'd be really confusing.
Use NPU rather than NPU2 for NPU-generic log messages. There's no risk of
confusion with the original npu.c code since that's only for P8.
Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Acked-by: Reza Arbab <arbab@linux.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
HMIs for NPU xstops are broadcasted to all chips. All cores on all the
chips receive HMI. HMI handler correctly identifies and extracts the
NPU FIR details from affected chip, but while printing FIR data it
prints chip id and location code details of this_cpu()->chip_id which
may not be correct. This patch fixes this issue.
CC: stable # v6.0+
Fixes: 7bcbc78c ("Add location code to NPU2 HMI logging")
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
[stewart: add fixes and cc stable]
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The current HMI error message does not specifiy where the HMI
error occured.
The original error message was
NPU: FIR#0 FIR 0x0080100000000000 mask 0x009a48180f01ffff
The enhanced error message is
NPU2: [Loc: UOPWR.0000000-Node0-Proc0] P:0 FIR#0 FIR 0x0000100000000000 mask 0x009a48180f03ffff
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In the recent patch:
eddff9bf40 hmi: Clear unknown debug trigger
I rebased the code from an older skiboot before the HMI rework. When I
did this, I missed the handled flag. Without this the HMER is not
cleared properly and the HMI keeps happening.
This properly sets the handled flag and hence clears the HMER bit.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Reviewed-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
On some systems, seeing hangs like this when Linux starts:
[ 170.027252763,5] OCC: All Chip Rdy after 0 ms
[ 170.062930145,5] INIT: Starting kernel at 0x20011000, fdt at 0x30ae0530 366247 bytes)
[ 171.238270428,5] OPAL: Switch to little-endian OS
If you look at the in memory skiboot console (or do 'nvram -p
ibm,skiboot --update-config log-level-driver=7') we see the console get
spammed with:
[ 5209.109790675,7] HMI: Received HMI interrupt: HMER = 0x0000400000000000
[ 5209.109792716,7] HMI: Received HMI interrupt: HMER = 0x0000400000000000
[ 5209.109794695,7] HMI: Received HMI interrupt: HMER = 0x0000400000000000
[ 5209.109796689,7] HMI: Received HMI interrupt: HMER = 0x0000400000000000
We're taking the debug trigger (bit 17) early on, before the
hmi_debug_trigger function in the kernel is set up.
This clears the HMI in Skiboot and reports to the kernel instead of
bringing down the machine.
Signed-off-by: Ryan Grimm <grimm@linux.vnet.ibm.com>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Reviewed-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
|
|
| |
Practically speaking, I don't think you'd *currently* hit this.
Found with Clang's scan-build.
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
Acked-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
|
| |
Processor recovery is per core error. All threads on that core receive
HMI. All threads don't need to generate HMI event for same error.
Let thread 0 only generate the event.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
| |
With this patch now we see reason string printed for CORE_WOF[43] bit.
[ 477.352234986,7] HMI: [Loc: U78D3.001.WZS004A-P1-C48]: P:8 C:22 T:3: Processor recovery occurred.
[ 477.352240742,7] HMI: Core WOF = 0x0000000000100000 recovered error:
[ 477.352242181,7] HMI: PC - Thread hang recovery
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
| |
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
| |
Due to P9 errata, HDEC parity and TB residue errors are latched for
non-zero threads 1-3 even if they are cleared. But these are not
latched on thread 0. Hence, use xscom SCOMC/SCOMD to read thread 0 tfmr
value and ignore them on non-zero threads if they are not present on
thread 0.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
| |
Helps in debugging...
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
While testing TFMR parity/corrupt error it has been observed that HMIs are
delivered twice for this error
- First time HMI is delivered with HMER[4,5]=1 and TFMR[60]=1.
- Second time HMI is delivered with HMER[4,5]=1 and TFMR[60]=0 with valid TB.
On second HMI we end up throwing below error message even though TB is in
valid state.
"HMI: TB invalid without core error reported"
This patch fixes this issue by ignoring HMER[5] and checking only for
TFMR[60] before setting this_cpu()->tb_invalid to true.
Suggested-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There are some TOD errors which do not affect working of TOD and TB. They
stay in valid state. Hence we don't need rendez vous for TOD errors that
does not affect TB working.
TOD errors that affects TOD/TB will report a global error on TFMR[44]
alongwith bit 51, and they will go in rendez vous path as expected.
But the TOD errors that does not affect TB register sets only TFMR bit 51.
The TFMR bit 51 is cleared when any single thread clears the TOD error.
Once cleared, the bit 51 is reflected to all the cores on that chip. Any
thread that reads the TFMR register after the error is cleared will see
TFMR bit 51 reset. Hence the threads that see TFMR[51]=1, falls through
rendez-vous path and threads that see TFMR[51]=0, returns doing
nothing. This ends up in a soft lockups in host kernel.
This patch fixes this issue by not considering TOD interrupt (TFMR[51])
as a core-global error and hence avoiding rendez-vous path completely.
Instead threads that see TFMR[51]=1 will now take different path that
just do the TOD error recovery.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
For TOD errors, all the cores in the chip get HMIs. Any one thread from any
core can fix the issue and TFMR will have error conditions cleared. Rest of
the threads need take any action if TOD errors are already cleared. Hence
thread 0 of every core should get a fresh copy of TFMR before going ahead
recovery path. Initialize recover = -1, so that if no errors found that
thread need not send a HMI event to linux. This helps in stop flooding host
with hmi event by every thread even there are no errors found.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Do this before we check for TFAC errors. Otherwise the event at host console
shows no error reported in HMER register.
Without this patch the console event show HMER with all zeros
[ 216.753417] Severe Hypervisor Maintenance interrupt [Recovered]
[ 216.753498] Error detail: Timer facility experienced an error
[ 216.753509] HMER: 0000000000000000
[ 216.753518] TFMR: 3c12000870e04000
After this patch it shows old HMER values on host console:
[ 2237.652533] Severe Hypervisor Maintenance interrupt [Recovered]
[ 2237.652651] Error detail: Timer facility experienced an error
[ 2237.652766] HMER: 0840000000000000
[ 2237.652837] TFMR: 3c12000870e04000
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
| |
This patch reworks the HMI handling for TFAC errors by introducing
4 rendez-vous points improve the thread synchronization while handling
timebase errors that requires all thread to clear dirty data from TB/HDEC
register before clearing the errors.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
| |
The test for TFAC error is now redundant so we remove it and
remove the HMER argument.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
| |
Currently no functional change. This is a first step to completely
rewriting how these things are handled.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
|
| |
It returns a 64-bit flags mask currently set to provide info
about which timer facilities were lost, and whether an event
was generated.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
|
| |
Writing to HMER acts as an "AND". The current code writes back the
value we originally read with the bits we handled cleared. This is
racy, if a new bit gets set in HW after the original read, we'll end
up clearing it without handling it.
Instead, use an all 1's mask with only the bit handled cleared.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
|
| |
We want to make sure all reporting and actions are based
upon the same snapshot of HMER in case bits get added
by HW while we are in OPAL.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
| |
Due to the nature of debugging npu2 issues, folk are wanting the
full list of NPU2 registers dumped when there's a problem.
We have to list out each register as traversing the range
triggers FIR bits that confuse PRD.
Suggested-by: Ryan Black <rblack@us.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This reverts commit fbdc91e693fc3103f7e2a65054ed32bfb26a2e17.
We don't need this as we need to do it a different way, with a explicit
set of registers as otherwise we trip other random FIR bits and everything
becomes even more terrible.
I suggest alcohol.
Cc: stable
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When an error is encountered that causes processor recovery, HMI is
generated if the recovery was successful. The reason is recorded in
the core FIR, which gets copied into the WOF.
In this case dump the WOF register and an error string into the OPAL
msglog.
A broken init setting led to HMIs reported in Linux as:
[ 3.591547] Harmless Hypervisor Maintenance interrupt [Recovered]
[ 3.591648] Error detail: Processor Recovery done
[ 3.591714] HMER: 2040000000000000
This patch would have been useful because it tells us exactly that
the problem is in the d-side ERAT:
[ 414.489690798,7] HMI: Received HMI interrupt: HMER = 0x2040000000000000
[ 414.489693339,7] HMI: [Loc: UOPWR.0000000-Node0-Proc0]: P:0 C:1 T:1: Processor recovery occurred.
[ 414.489699837,7] HMI: Core WOF = 0x0000000410000000 recovered error:
[ 414.489701543,7] HMI: LSU - SRAM (DCACHE parity, etc)
[ 414.489702341,7] HMI: LSU - ERAT multi hit
In future it will be good to unify this reporting, so Linux could
print something more useful. Until then, this gives some good data.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
|
|
|
|
|
|
|
|
| |
This is not the way we want to end up doing this.
This is a hack to make folk happy and not require crondump to
debug nvidia/npu2 issues.
Cc: stable
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
|
|
|
|
|
|
|
| |
Log HMI errors as step 1. OS will need to deduce
and interpret the HMI event.
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Acked-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
|
|
|
|
|
| |
No functionality change.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
|
|
|
|
|
|
| |
So that we don't flood OPAL console logs with information that is not
useful.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add few HMI debug prints with location code info few additional info.
No functionality change.
With this patch the log messages will look like:
[210612.175196744,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000
[210612.175200449,7] HMI: [Loc: UOPWR.1302LFA-Node0-Proc1]: P:8 C:16 T:1: TFMR(2d12000870e04020) Timer Facility Error
[210660.259689526,7] HMI: Received HMI interrupt: HMER = 0x2040000000000000
[210660.259695649,7] HMI: [Loc: UOPWR.1302LFA-Node0-Proc0]: P:0 C:16 T:1: Processor recovery Done.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
|
|
|
|
|
|
| |
No functionality changes.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add a workaround for a HW logic bug in Power9 where TB residue and HDEC
parity errors cleared by one thread aren't visible to other threads of same
core. The TB reside and HDEC parity error are reported through TFMR bit 45
and 26 respectively. If any of the thread from the core clears the TFMR bit
26 and 45, only thread 0 is able to see that errors are cleared but rest of
the threads 1, 2 and 3 do not see those as cleared. This causes TB error
recovery to fail for TB residue and HDEC parity errors. TFMR is per core
register and any changes made by a one thread should be visible by other
threads of the same core.
On TB residue error (TFMR bit 45), TB goes into invalid state. Hence avoid
handling/clearing TB residue error if TB is valid and running. Use TFMR bit 41
to check validity of TB state.
For HDEC parity error (TFMR bit 26), check for other errors on TFMR register
and ignore the pre-recovery for HDEC parity error. If TFMR has any other
TB error bits set alongwith HDEC parity error we can safely ignore handling
of HDEC parity error. Also, while clearing HDEC parity error bit from TFMR,
allow only thread 0 to clear it.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
| |
On TB/HDEC errors, all 4 threads on the affected receives HMI. On power9,
every thread on the core has its own copy of TB/HDEC and hence every thread
has to clear the dirty data from its own TB/HDEC register before we clear tb
errors through TFMR[24]. The HMI recovery would fail even if one thread
do not cleanup the respective TB/HDEC register.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
A rework of where some of the xscom regs are for POWER9 has resulted in
a scope issue where the same line attempts to simultaneously reference a
variable by the same name in global and function scope.
Change the value read by xscom_read to *_val
Fixes: CID 147808
Fixes: bda5e0ea Fix scom addresses for power9 nx checkstop hmi handling.
Signed-off-by: Cyril Bur <cyril.bur@au1.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
|
|
|
|
|
|
| |
Scom addresses for NX status, DMA & ENGINE FIR and PBI FIR has changed
for Power9. Fixup thoes while handling nx checkstop for Power9.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
|
| |
Scom addresses for CORE FIR (Fault Isolation Register) and Malfunction
Alert Register has changed for Power9. Fixup those while handling core
checkstop for Power9.
Without this change HMI handler fails to check for correct reason for
core checkstop on Power9.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
|
| |
Find the CAPP on the chip associated with the HMI event for PHB4.
The recovery mode (re-initialization of the capp, resume of functional
operations) is only available with P9 DD2. A new patch will be provided
to support this feature.
Signed-off-by: Christophe Lombard <clombard@linux.vnet.ibm.com>
Reviewed-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The definitions of the CAPP registers for PHB3 are moved in a specific
file.
The updated file capp.h will be used for the common functionalities
about the CAPP for PHB3 and PHB4.
Signed-off-by: Christophe Lombard <clombard@linux.vnet.ibm.com>
Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Reviewed-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
| |
The pattern of calling cpu_relax() inside a polling loop does
not suit the powerpc SMT priority instructions. Prefrred is to
set a low priority then spin until break condition is reached,
then restore priority.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[stewart@linux.vnet.ibm.com: fixup lpc-uart wait_tx_room() and unit test]
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When diagnosing or debugging CAPP errors, it's rather useful to have the
CAPP FIR, which often provides very helpful information.
Print the CAPP FIR to the log when we handle a Malfunction Alert HMI for a
CAPP error.
Cc: Frederic Barrat <fbarrat@linux.vnet.ibm.com>
Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Reviewed-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The logic in decode_malfunction() is rather tricky to follow. Refactor the
code to make it easier to follow.
No functional change.
Cc: Russell Currey <ruscur@russell.cc>
Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Cc: Ryan Grimm <grimm@linux.vnet.ibm.com>
Cc: Frederic Barrat <fbarrat@linux.vnet.ibm.com>
Cc: Daniel Axtens <dja@axtens.net>
Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Reviewed-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Reviewed-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
|
|
|
|
|
|
| |
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
[stewart@linux.vnet.ibm.com: unlock before return (suggested by Mahesh/Andrew),
disable only on non-cancelling fsp codeupdate call (suggested by Vasant)]
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|