| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
| |
Fixes: 2c8f96534a978bb4cac3e4b7dd393a9cc4926555
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
| |
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This was disabled at some point during bringup to make life easier for
the lab folks trying to debug NVLink issues. This hack really should
have never made it out into the wild though, so we now have the
following situation occuring in the field:
1) A bad happens
2) The host kernel recieves an unrecoverable HMI and calls into OPAL to
request a platform reboot.
3) OPAL rejects the reboot attempt and returns to the kernel with
OPAL_PARAMETER.
4) Kernel panics and attempts to kexec into a kdump kernel.
A side effect of the HMI seems to be CPUs becoming stuck which results
in the initialisation of the kdump kernel taking a extremely long time
(6+ hours). It's also been observed that after performing a dump the
kdump kernel then crashes itself because OPAL has ended up in a bad
state as a side effect of the HMI.
All up, it's not very good so re-enable the software checkstop by
default. If people still want to turn it off they can using the nvram
override.
Cc: skiboot-stable@lists.ozlabs.org
Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Acked-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
| |
Enable a new PVR to get us running on another p9 variant.
Signed-off-by: Reza Arbab <arbab@linux.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
| |
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
| |
We print out a whole bunch of things on boot, most of which aren't
interesting, so we should *not* print them instead.
Printing things like what CPUs we found and what PCI devices we found
*are* useful, so continue to do that. But we don't need to splat out
a bunch of things that are always going to be true.
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This reverts commit fbdc91e693fc3103f7e2a65054ed32bfb26a2e17.
We don't need this as we need to do it a different way, with a explicit
set of registers as otherwise we trip other random FIR bits and everything
becomes even more terrible.
I suggest alcohol.
Cc: stable
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
|
|
|
|
|
|
|
|
| |
This is not the way we want to end up doing this.
This is a hack to make folk happy and not require crondump to
debug nvidia/npu2 issues.
Cc: stable
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add a mechanism to enable/disable sw checkstop by looking at nvram option
opal-sw-xstop=<enable/disable>.
For now this patch disables the sw checkstop trigger unless explicitly
enabled through nvram option 'opal-sw-xstop=enable'i for p9. This will allow
an opportunity to get host kernel in panic path or xmon for unrecoverable
HMIs or MCE, to be able to debug the issue effectively.
To enable sw checkstop in opal issue following command:
# nvram -p ibm,skiboot --update-config opal-sw-xstop=enable
NOTE: This is a workaround patch to disable sw checkstop by default to gain
control in host kernel for better checkstop debugging. Once we have most of
the checkstop issues stabilized/resolved, revisit this patch to enable sw
checkstop by default.
For p8 platform it will remain enabled by default unless explicitly disabled.
To disable sw checkstop on p8 issue following command:
# nvram -p ibm,skiboot --update-config opal-sw-xstop=disable
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Reviewed-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Due to a hardware issue where core responding to scom was delayed due to
thread reconfiguration, leaves the SCOM logic in a state where the
subsequent scom to that core can get errors. This is affected for Core
PC scom registers in the range of 20010A80-20010ABF
The solution is if a xscom timeout occurs to one of Core PC scom registers
in the range of 20010A80-20010ABF, a clearing scom write is done to
0x20010800 with data of '0x00000000' which will also get a timeout but
clears the scom logic errors. After the clearing write is done the original
scom operation can be retried.
The scom timeout is reported as status 0x4 (Invalid address) in HMER[21-23].
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
|
|
|
|
|
|
|
|
| |
So caller of xscom_reset() does not have to bother about adding a delay
separately. Instead caller can control whether to add a delay or not using
second argument to xscom_reset().
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Workaround on P9: PRD does operations it *knows* will fail with this
error to work around a hardware issue where accesses via the PIB
(FSI or OCC) work as expected, accesses via the ADU (what xscom goes
through) do not. The chip logic will always return all FFs if there
is any error on the scom.
Suggested-by: Daniel M Crowell <dcrowell@us.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
Acked-by: Jeremy Kerr <jk@ozlabs.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
xscom_read/write operations returns CHIPLET_OFFLINE when chiplet is offline.
Some multicast xscom_read/write requests from HBRT results in xscom operation
on offline chiplet(s) and printing below warnings in OPAL console.
[ 135.036327572,3] XSCOM: Read failed, ret = -14
[ 135.092689829,3] XSCOM: Read failed, ret = -14
This results in unnecessary bugs. Hence remove error message for multicast
SCOM operations.
Suggested-by: Daniel M Crowell <dcrowell@us.ibm.com>
Tested-by: Pridhiviraj Paidipeddi <ppaidipe@linux.vnet.ibm.com>
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Reviewed-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
|
|
|
|
|
| |
Current code only supports DD1. This adds support for DD2.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
|
|
|
|
|
|
|
|
| |
It is common for xscom registers to only contain specific bit fields that
need to be modified without altering the rest of the register. This adds a
convenience function to perform xscom read-modify-write operations under a
mask.
Signed-off-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
90% of what we print isn't useful to a normal user. This
dramatically reduces the amount of messages printed by
OPAL in normal circumstances.
We still need to add a way to bump the log level at boot
based on a BMC scratch register or some HDAT property.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In the mambo tcl we set the CPU version to DD 2.0, because mambo is not
bug compatible with DD 1.
But in xscom_read_cfam_chipid() we have a hard coded value, to work
around the lack of the f000f register, which claims to be P9 DD 1.0.
This doesn't seem to cause crashes or anything, but at boot we do see:
[ 0.003893084,5] XSCOM: chip 0x0 at 0x1a0000000000 [P9N DD1.0]
So fix it to claim that the xscom is also DD 2.0 to match the CPU.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add code to perform indirect form 1 scoms.
POWER8 does form 0 only. POWER9 adds form 1. The form is determined
from the address only. Hardware only allows writes for form 1.
Only hostboot uses these scoms during IPL, so they are unused by
skiboot currently.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
|
|
|
|
|
|
|
| |
Indirect scoms can only set certain bits of data. Ensure only these
are set when trying to write.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
|
|
|
|
|
|
| |
Add scom reset registers for POWER9.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
|
|
|
|
|
|
| |
Abstract error recovery registers to get ready for POWER9.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
|
|
|
|
|
| |
SCOM doesn't exist, we end up in a sad place (machine check)
Fixes: c9cadb4fe60d4d41fc45a35a1d2ae27e0632c679
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
|
|
|
|
|
| |
To differentiate between 1.00, 1.01, 1.02 etc...
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
| |
We add some routines that let a caller get the xscom lock once and
then do a bunch of xscoms while holding it.
In some situations without this, it could take long enough to get
the xscom lock that the 1ms timeout would expire and we'd falsely
think the SLW timer didn't work when in fact it did.
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
|
|
|
|
|
|
|
|
| |
Instead of mapping them to just 3 different codes, define an OPAL
error code for all known HMER error status, as different recovery
path might be needed at the call site, and it allows for more
informative logging.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
|
|
|
|
|
|
|
| |
In case of error, don't leave the data random. It helps debugging when
the user fails to check the error code. This happens due to a bug in the
PRD wrapper app.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
| |
The checks validate pointers sent in using
opal_addr_valid() in opal_call API's provided
via the console, cpu, fdt, flash, i2c, interrupts,
nvram, opal-msg, opal, opal-pci, xscom and
cec modules
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
|
|
|
|
|
| |
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Michael Neuling <mikey@neuling.org>
[stewart@linux.vnet.ibm.com: added FWTS annotation]
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
| |
XSCOM engine blocks subsequently after querying FIR of any
sleeping core. This causes subsequent XSCOM opertions to hang
forever due to XSCOM engine being continuously busy. Reset XSCOM
engine after querying FIR of any sleeping core.
Signed-off-by: Vipin K Parashar <vipin@linux.vnet.ibm.com>
Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
|
|
|
|
|
| |
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Michael Neuling <mikey@neuling.org>
[stewart@linux.vnet.ibm.com: split from timebase quirk patch]
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
|
|
|
|
|
|
|
|
| |
More to go, especially we need to review recovery, but at least
this enables indirect and errors out on not-yet-supported EX
targeting.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
|
|
|
|
| |
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
|
|
|
|
| |
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
OPAL retries XSCOM read/write operations forever till it succeeds.
This can cause XSCOM ops to hang forever when XSCOM engine remains
busy for some reason. Changed it to retry XSCOM operations only
XSCOM_BUSY_MAX_RETRIES number of times instead of retrying forever.
Also added logic to reset XSCOM engine after XSCOM_BUSY_RESET_THRESHOLD
number of retries to unblock it when it remains busy.
Cc: stable # 9c2d82394fd2 ("xscom: Return OPAL_WRONG_STATE on XSCOM ops..")
Signed-off-by: Vipin K Parashar <vipin@linux.vnet.ibm.com>
Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
| |
Add PVR detection, chip id and other misc bits for POWER9.
POWER9 changes the location of the HILE and attn enable bits in the
HID0 register, so add these definitions also.
Signed-off-by: Michael Neuling <mikey@neuling.org>
[stewart@linux.vnet.ibm.com: Fix Numbus typo, hdata_to_dt build fixes]
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
xscom_read and xscom_write return OPAL_SUCCESS if they worked, and
OPAL_HARDWARE if they didn't. This doesn't provide information about why
the operation failed, such as if the CPU happens to be asleep.
This is specifically useful in error scanning, so if every CPU is being
scanned for errors, sleeping CPUs likely aren't the cause of failures.
So, return OPAL_WRONG_STATE in xscom_read and xscom_write if the CPU is
sleeping.
Signed-off-by: Russell Currey <ruscur@russell.cc>
Reviewed-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
|
|
|
|
| |
hw/xscom.c:425:23: warning: constant 0x221EF04980000000 is so big it is long
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
|
|
|
|
|
|
|
| |
It works just like P8, we copy the code for now rather than make
it somewhat common due to our locking differences and to limit
the risk close to release. We can refactor later.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
|
|
|
|
|
|
| |
We didn't pass the right "is_write" argument for writes and
the string used for logging was somewhat confusing.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch adds a new OPAL call OPAL_CEC_REBOOT2 which will
be used to handle abnormal reboot/termination by kernel host.
This call will allow host kernel to pass reboot type and additional
debug data which needs to be captured/saved somewhere (for later
analysis) before going down.
Currently it will support two reboot types (0). normal reboot, that
will behave similar to that of opal_cec_reboot() call, and
(1). platform error reboot, that will trigger a system checkstop
using xscom address and FIR bit information obtained via device-tree
property 'ibm,sw-checkstop-fir'.
For unsupported reboot type, this call will do nothing and return
with OPAL_UNSUPPORTED.
In future, we can overload this call to support additional reboot types.
Signed-off-by: Vipin K Parashar <vipin@linux.vnet.ibm.com>
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Reviewed-by: Samuel Mendoza-Jonas <sam.mj@au1.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
|
|
|
|
|
|
|
|
| |
There are now no users of the call_out parameter and future users should
use the log_append_msg() and log_append_data() functions, so remove all
references to call_out.
Signed-off-by: Samuel Mendoza-Jonas <sam.mj@au1.ibm.com>
Reviewed-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
|
|
|
|
|
|
|
| |
If we get an error from the xscom_read() call we could extract a
garbage chip_id rather than just returning an error.
Caught by LLVM scan-build
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|\ |
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
his primarily checks whether the caller already holds the corresponding
locks to avoid re-entrancy in some of the deep error path such as when
XSCOM itself triggers an error log. It will be extended in the case of
LPC to also handle known HW error states.
We use them to avoid queuing/polling in the BT driver and to discard
characters in the UART driver.
Note: This will not normally involve a loss of log to the UART as the
UART driver is also protected by the console suspend mechanism. So
this is a safety mechanism only.
This fixes issues where the generation of error logs inside the LPC or
XSCOM drivers could cause a re-entrancy (via the BT interface)
causing deadlocks. Now, the error logs IPMI messages will be queued up
and delivered later on the next poll handler.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
| |
| |
| |
| |
| |
| |
| | |
Nothing should be using it nowadays and it's dangerous.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
| |
| |
| |
| |
| | |
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|/
|
|
|
|
|
|
|
|
| |
This adds the PVR and CFAM ID for the Naples chip. Otherwise treated as
a Venice.
This doesn't add the definitions for the new PHB revision yet
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
|
|
|
|
|
|
|
| |
Create xscom_read_cfam_chipid() to read the cfam chipid().
We'll need to use this in a few places, so avoid replicating the code.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The last two patches updated GETFIELD() and SETFIELD() to no longer
require the user to specify the mask and shift of a field, and to
remove all _LSH defines and rename any _MASK defines. There are
some places where the masks were used directly, where the caller
needs to have the _MASK suffix removed. There are also two users
of SETFIELD() where the field name still has the _MASK suffix
because there is an existing macro with the base name.
Change users of SETFIELD() to include the _MASK suffix where needed.
Change direct users of any mask to remove the _MASK suffix.
Signed-off-by: Dan Streetman <ddstreet@ieee.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The last patch changed the SETFIELD() and GETFIELD() macros to automatically
calculate the shift of a given mask, so manually specifying the shift is no
longer needed. Additionally, any masks should have the _MASK suffix removed
since the GETFIELD() and SETFIELD() operations expected to be passed the
mask name without the _MASK suffix (and so either the mask name or the
get/setfield call needs to have its mask name changed).
Change all _MASK masks to remove the _MASK suffix, except for any places
that leaving _MASK makes sense (e.g. already an existing define without
_MASK suffix).
Remove all _LSH defines, as they are no longer needed.
Signed-off-by: Dan Streetman <ddstreet@ieee.org>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|