| Commit message (Collapse) | Author | Age | Files | Lines |
... | |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
ipmi-hiomap implements the PNOR access control protocol formerly known
as "the mbox protocol" but uses IPMI instead of the AST LPC mailbox as a
transport. As there is no-longer any mailbox involved in this alternate
implementation the old protocol name is quite misleading, and so it has
been renamed to "the hiomap protoocol" (Host I/O Mapping protocol). The
same commands and events are used though this client-side implementation
assumes v2 of the protocol is supported by the BMC.
The code is a heavily-reworked copy of the mbox-flash source and is
introduced this way to allow for the mbox implementation's eventual
removal.
mbox-flash should in theory be renamed to mbox-hiomap for consistency,
but as it is on life-support effective immediately we may as well just
remove it entirely when the time is right.
Signed-off-by: Andrew Jeffery <andrew@aj.id.au>
[stewart: prlog debug over prerror for mbox fallback, fix indent]
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
| |
This ensures progress when we don't have interrupts available for IPMI.
Signed-off-by: Andrew Jeffery <andrew@aj.id.au>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously in flash_register() held flash_lock across ffs_init(), which
calls through the blocklevel layer to read the flash. This is unhelpful
with the IPMI HIOMAP protocol transport as LPC interrupts have not yet
been enabled and we are relying on polling to progress. The held lock
stalls the boot as we take the nopoll path in time_wait() while
completing ipmi_queue_msg_sync() in libflash/ipmi-flash.c
Signed-off-by: Andrew Jeffery <andrew@aj.id.au>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
| |
Otherwise we can get reports of core/lock.c owning the lock, which is
not helpful when tracking down ownership issues.
Signed-off-by: Andrew Jeffery <andrew@aj.id.au>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
| |
Signed-off-by: Andrew Jeffery <andrew@aj.id.au>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
bust_locks is a big hammer that guarantees a mess if it's set while
all other threads are not stopped.
I propose removing this in the lock error paths. In debugging the
previous deadlock false positive, none of the error messages printed,
and the in-memory console was totally garbled due to lack of locking.
I think it's generally better for debugging and system integrity to
keep locks held when lock errors occur. Lock busting should be used
carefully, just to allow messages to be printed out or machine to be
restarted, probably when the whole system is single-threaded.
Skiboot is slowly working toward that being feasible with co-operative
debug APIs between firmware and host, but for the time being,
difficult lock crashes are better not to corrupt everything by
busting locks.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If a lock waiter exceeds the warning timeout, it prints a message
while still registered as requesting the lock. Printing the message
can take locks, so if one is held when the owner of the original
lock tries to print a message, it will get a false positive deadlock
detection, which brings down the system.
This can easily be hit when there is a lot of HMI activity from a
KVM guest, where the timebase was not returned to host timebase
before calling the HMI handler.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
On DD2.0 parts, PCIe ECC protection is not warranted in the response
data path. Thus, for these parts, we need to flag any ECC errors
detected from the adjacent AIB RX Data path so the part can be
replaced.
This patch configures the FIRs so that we escalate these AIB ECC
errors to a checkstop so the parts can be replaced.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently if we try to run a raw/stripped binary kernel (ie. without
the elf header) we crash with:
[ 0.008757768,5] INIT: Waiting for kernel...
[ 0.008762937,5] INIT: platform wait for kernel load failed
[ 0.008768171,5] INIT: Assuming kernel at 0x20000000
[ 0.008779241,3] INIT: ELF header not found. Assuming raw binary.
[ 0.017047348,5] INIT: Starting kernel at 0x0, fdt at 0x3044b230 14339 bytes
[ 0.017054251,0] FATAL: Kernel is zeros, can't execute!
[ 0.017059054,0] Assert fail: core/init.c:590:0
[ 0.017065371,0] Aborting!
This is because we haven't set kernel_entry correctly in this path.
This fixes it.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When primary thread receives a CORE level HMI for timer facility errors
while secondaries are still in OPAL, thread 0 ends up in rendez-vous
waiting for secondaries to get into hmi handling. This is because OPAL
runs with MSR(EE=0) and hence HMIs are delayed on secondary threads until
they are given to Linux OS. Fix this by adding a check for secondary
state and force them in hmi handling by queuing job on secondary threads.
I have tested this by injecting HDEC parity error very early during Linux
kernel boot. Recovery works fine for non-TB errors. But if TB is bad at
this very eary stage we already doomed.
Without this patch we see:
[ 285.046347408,7] OPAL: Start CPU 0x0843 (PIR 0x0843) -> 0x000000000000a83c
[ 285.051160609,7] OPAL: Start CPU 0x0844 (PIR 0x0844) -> 0x000000000000a83c
[ 285.055359021,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000
[ 285.055361439,7] HMI: [Loc: U78D3.ND1.WZS004A-P1-C48]: P:8 C:17 T:0: TFMR(2e12002870e14000) Timer Facility Error
[ 286.232183823,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 1 (sptr=0000ccc1)
[ 287.409002056,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 2 (sptr=0000ccc1)
[ 289.073820164,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 3 (sptr=0000ccc1)
[ 290.250638683,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 1 (sptr=0000ccc2)
[ 291.427456821,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 2 (sptr=0000ccc2)
[ 293.092274807,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 3 (sptr=0000ccc2)
[ 294.269092904,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 1 (sptr=0000ccc3)
[ 295.445910944,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 2 (sptr=0000ccc3)
[ 297.110728970,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 3 (sptr=0000ccc3)
After this patch:
[ 259.401719351,7] OPAL: Start CPU 0x0841 (PIR 0x0841) -> 0x000000000000a83c
[ 259.406259572,7] OPAL: Start CPU 0x0842 (PIR 0x0842) -> 0x000000000000a83c
[ 259.410615534,7] OPAL: Start CPU 0x0843 (PIR 0x0843) -> 0x000000000000a83c
[ 259.415444519,7] OPAL: Start CPU 0x0844 (PIR 0x0844) -> 0x000000000000a83c
[ 259.419641401,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000
[ 259.419644124,7] HMI: [Loc: U78D3.ND1.WZS004A-P1-C48]: P:8 C:17 T:0: TFMR(2e12002870e04000) Timer Facility Error
[ 259.419650678,7] HMI: Sending hmi job to thread 1
[ 259.419652744,7] HMI: Sending hmi job to thread 2
[ 259.419653051,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000
[ 259.419654725,7] HMI: Sending hmi job to thread 3
[ 259.419654916,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000
[ 259.419658025,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000
[ 259.419658406,7] HMI: [Loc: U78D3.ND1.WZS004A-P1-C48]: P:8 C:17 T:2: TFMR(2e12002870e04000) Timer Facility Error
[ 259.419663095,7] HMI: [Loc: U78D3.ND1.WZS004A-P1-C48]: P:8 C:17 T:3: TFMR(2e12002870e04000) Timer Facility Error
[ 259.419655234,7] HMI: [Loc: U78D3.ND1.WZS004A-P1-C48]: P:8 C:17 T:1: TFMR(2e12002870e04000) Timer Facility Error
[ 259.425109779,7] OPAL: Start CPU 0x0845 (PIR 0x0845) -> 0x000000000000a83c
[ 259.429870681,7] OPAL: Start CPU 0x0846 (PIR 0x0846) -> 0x000000000000a83c
[ 259.434549250,7] OPAL: Start CPU 0x0847 (PIR 0x0847) -> 0x000000000000a83c
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
| |
Rename and set it as a pre_pci_fixup platform function. The indirect
call doesn't make a whole of of sense IMO.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
| |
Vague documentation is about as annoying as no documentation.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit eb146fac9685 ("core/i2c: Move the timeout field into
i2c_request") simplified a bit how a request timeout is
handled. However there's now some confusion between milliseconds and
timebase increments when defining or using the timeout values, which
breaks i2c requests made for opencapi, and probably others too.
This patch declares all the timeout in milliseconds and just converts
to timebase at the end of the chain, as needed.
Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com>
Tested-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Reviewed-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
| |
There are patches that will go into dtc to fix the issues we hit, but
for the moment let's just build and use a slightly older version.
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Below message is confusing. Lets make it clear.
FSP sends "R/R complete notification" whenever there is a dump. We use `flag`
to identify whether its its R/R completion -OR- just new dump notification.
[ 483.406351956,6] FSP: SP says Reset/Reload complete
[ 483.406354278,5] DUMP: FipS dump available. ID = 0x1a00001f [size: 6367640 bytes]
[ 483.406355968,7] A Reset/Reload was NOT done
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
|
|
| |
PMEM_VOLATILE and PMEM_DISK can't be used together and are basically
copies of the same code.
This merges the two and allows them used together. Same API is kept.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
| |
Couple of places in 'phb4.c' where we may want to dump the PEC's error
registers. Hence we introduce a phb4_dump_pec_err_regs() that dumps
all the PEC error registers and also update phb4->nfir_cache &
phb4->pfir_cache for later use.
Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Reviewed-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
During fast-reboot new PEC errors can be latched even after ETU-Reset
is asserted. This will result in values of variables nfir_cache and
pfir_cache to be out of sync.
During step-2 of CRESET nfir_cache and pfir_cache values are used to
bring the PHB out of reset state. However if these variables are out
as noted above of date the nfir/pfir registers are never reset
completely and ETU still remains frozen.
Hence this patch updates step-2 of phb4_creset to re-read the values of
nfir/pfir registers to check if any new errors were reported after
ETU-reset was asserted, report these new errors and reset the
nfir/pfir registers. This should bring the ETU out of reset
successfully.
Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Tested-By: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Reviewed-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
|
| |
Allow specifying a file on the command line to read OCC SRAM data into.
If no file is specified then we print it to stdout as text. This is a
bit inconsistent, but it retains compatibility with the existing tool.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
|
| |
The XSCOM base address of the OCC control registers changed slightly
between P8 and P9. Fix this up and add a bit of PVR checking so we look
in the right place.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This takes a checksum of skiboot memory after boot that should be
unchanged during OS operation, and verifies it before allowing a
fast reboot.
This is not read-only memory from skiboot's point of view, beause
it includes things like the opal branch table that gets populated
during boot.
This helps to improve the integrity of firmware against host and
runtime firmware memory scribble bugs.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
| |
This also tidies up linker script symbol declarations and adds
_rodata_mem symbol for the next change to use.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
|
|
| |
Mambo image payloads get overwritten by the OS and by
fast reboot memory clearing because they have no region
defined. Add them, which allows fast reboot to work.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[stewart: fix up 'make check']
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
|
|
| |
Once things start to go wrong, disable_fast_reboot can be called a
number of times, so make the first reason sticky, and also print it
to the console at disable time. This helps with making sense of
fast reboot disables.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
| |
I missed a hunk when merging :(
Reported-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Fixes: 7c8e1c6f89f3aac77661cfcee75ab515bd053d75
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
| |
this means that if it's permanently disabled on boot, the test suite can
pick that up and not try a fast reboot test.
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
| |
We print a message when nvram_query() needs to wait for the NVRAM to
be loaded from the BMC/FSP. Currently this is printed at PR_WARNING
which is excessive since this doesn't actually indicate that anything is
wrong. There's also nothing that we can really do about loading the
NVRAM being slow, so just print this at PR_DEBUG.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
| |
Print how long we had to wait for NVRAM to become available if we needed
to wait.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
A fault will occur if 'value == NULL' is passed to nvram_query_eq() to
check if a given key doesn't exists in nvram partition. This is an
invalid use of the API as its only supposed to be used for keys that
exist in nvram and 'value == NULL' is never possible.
Hence this patch adds an assert to the function to flag such a use and
also prevent NULL being passed as an argument to strcmp().
Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Suggested-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
I don't see a reason why there would need to be a PHB3 *specific*
subsystem in the error logs, so rename it to PHB so that PHB4 and
later can use it too without continually redefining it.
This shouldn't change any existing assumptions because it's unused.
Signed-off-by: Russell Currey <ruscur@russell.cc>
Reviewed-by: Oliver O'Halloran <oohall@gmail.com>
Reviewed-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Without this, on some P8 platforms, we could (falsely) think the SBE timer
had stalled getting the dreaded "timer stuck" message.
The code was doing the mftb() to set the start of the timeout period while
*not* holding the lock, so the 1ms timeout started sometime when somebody
else had the xscom lock.
The simple solution is to just do the whole routine holding the xscom lock,
so do it that way.
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In this commit
commit 737c0ba3d72b8aab05a765a9fc111a48faac0f75
Author: Michael Neuling <mikey@neuling.org>
Date: Thu Feb 22 10:52:18 2018 +1100
phb4: Disable lane eq when retrying some nvidia GEN3 devices
We made a typo and set PH2 twice. This fixes it.
It worked previously as if only phase 2 (PH2) is set it, skips phase 2
and phase 3 (PH3).
Reported-by: Meng Li <shlimeng@cn.ibm.com>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
| |
These are now pointless and they can be replaced with zalloc() and free().
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
|
| |
The p8_i2c_request structure is barely used and the only useful data it
contains (port_num) can be derived from the bus pointer. Remove it in
preperation for removing the per-bus allocation and free methods.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
| |
Currently to set a per-request timeout you need to use
i2c_req_set_timeout() which is a wrapper for a per-bus method that sets the
actual timeout. This design doesn't make a whole lot of sense, so move
the timeout field into the generic i2c_request structure and set the
timeout to be set using that.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Core FIR[60] is a side effect of the work around for the CI Vector Load
issue in DD2.1. Usually this gets delivered as HMI with HMER[17] where
Linux already ignores it. But it looks like in some cases we may happen
to see CORE_FIR[60] while we are already in Malfunction Alert HMI
(HMER[0]) due to other reasons e.g. CAPI recovery or NPU xstop. If that
happens then just ignore it instead of crashing kernel as not recoverable.
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Reviewed-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Presently a fault will happen in opal_sync_host_reboot if a callback
tries to remove itself from the opal_syncers list by calling
opal_del_host_sync_notifier.
This happens as iteration over opal_syncers is done using the
list_for_each() which doesn't preserve list_node->next. So when
the current opal_syncers callback removes itself from the list, current
node contents are lost and current_node->next pointer is rendered
invalid.
To fix this we simply switch from list_for_each() to
list_for_each_safe() which keeps the current_node->next cached hence
even if the current node is freed, iteration over subsequent nodes can
still continue.
Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Reviewed-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
OpenCAPI on Witherspoon is slightly more involved than on Zaius and ZZ, due
to the OpenCAPI links using the SXM2 connectors that are used for NVLink
GPUs.
This patch adds the regular OpenCAPI platform information, and also a
Witherspoon-specific presence detection callback that uses the previously
added OCC GPU presence detection to figure out the device types plugged
into each SXM2 socket.
The SXM2 connectors are capable of carrying 2 OpenCAPI links, and future
OpenCAPI devices are expected to make use of this. However, we don't yet
support ganged links and the various implications that has for handling
things like device reset, so for now, we only enable 1 brick per device.
Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Acked-by: Reza Arbab <arbab@linux.ibm.com>
Reviewed-by: Alistair Popple <alistair@popple.id.au>
Reviewed-by: Frederic Barrat <fbarrat@linux.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In platform_ocapi, we define i2c_{reset,presence}_odl{0,1} to specify the
appropriate reset/presence GPIO pins for devices connected to ODL0 and ODL1
respectively.
This is obviously wrong, because a device connected to brick 2 and a device
connected to brick 4 are going to be different devices connected to
different I2C pins, but rather conveniently we haven't had to deal with
systems that can use the full 4 bricks as yet. Now that we're adding
OpenCAPI support for Witherspoon, we should change this to specify pins
separately for all 4 bricks.
Replace i2c_{reset,presence}_odl{0,1} with
i2c_{reset,presence}_brick{2,3,4,5} and update the presence detection code,
device reset code, and existing platforms accordingly.
Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Reviewed-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com>
Reviewed-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There is no standardised way to determine the presence and type of devices
connected to an NPU on POWER9.
Currently, we hardcode device types based on platform type (as no platform
currently supports both OpenCAPI and NVLink), and for OpenCAPI platforms
we use I2C to detect presence.
Witherspoon (and potentially other platforms later on) supports both
NVLink and OpenCAPI, and additionally uses SXM2 connectors which can carry
more than one link, rather than the SlimSAS connectors used for OpenCAPI on
Zaius and ZZ. This necessitates some special handling.
Add a platform callback for NPU device detection. In a later patch, we
will use this to implement Witherspoon-specific device detection. For now,
add a Witherspoon stub that sets all links to NVLink (i.e. current
behaviour).
Move the existing I2C-based presence detection for OpenCAPI devices on
Zaius/ZZ into common code, which we use by default for platforms which do
not define a callback. Clean up the use of the ibm,npu-link-type property,
which will now be exposed solely for debugging and not consumed internally.
Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Reviewed-by: Frederic Barrat <fbarrat@linux.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Replace probe_npu2() and probe_npu2_opencapi() with a new shared
probe_npu2(). Refactor some of the common NPU setup code into shared code.
No functional change. This patch does not implement support for using both
types of devices simultaneously on the same NPU - we expect to add this
sometime in the future.
Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Acked-by: Reza Arbab <arbab@linux.ibm.com>
Reviewed-by: Alistair Popple <alistair@popple.id.au>
Reviewed-by: Frederic Barrat <fbarrat@linux.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
On Witherspoon, OpenCAPI devices attached to link indexes 0 and 1 are
handled by bricks 2 and 3.
Rename index to brick_index, and add a new field, link_index, to
refer to the link index. For now, we set those values identically.
Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Acked-by: Reza Arbab <arbab@linux.ibm.com>
Reviewed-by: Alistair Popple <alistair@popple.id.au>
Reviewed-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It takes a few seconds for the OCC to set everything up in order to read
GPU presence. At present, we try to kick off OCC initialisation as early as
possible to maximise the time it has to read GPU presence.
Unfortunately sometimes that's not enough, so add a loop in
occ_get_gpu_presence() so that on the first time we try to get GPU presence
we keep trying for up to 2 seconds. Experimentally this seems to be
adequate.
Fixes: 9b394a32c8ea ("occ: Add support for GPU presence detection")
Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Reviewed-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
| |
kill_type is enum of OPAL_PCI_TCE_KILL_PAGES, OPAL_PCI_TCE_KILL_PE,
OPAL_PCI_TCE_KILL_ALL and phb4_tce_kill() gets it right but
npu2_tce_kill() uses OPAL_PCI_TCE_KILL which is an OPAL API token.
This fixes an obvious mistype.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The RX_RC_ENABLE_AUTO_RECAL flag is required on OpenCAPI but not NVLink.
Traditionally, Hostboot sets this value according to the machine type.
However, now that Witherspoon supports both NVLink and OpenCAPI, it can't
tell whether or not a link is OpenCAPI.
So instead, set it in skiboot, where it will only be triggered after we've
done device detection and found an OpenCAPI device.
Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Reviewed-by: Frederic Barrat <fbarrat@linux.ibm.com>
Acked-by: Reza Arbab <arbab@linux.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In opal_npu_tl_set(), we made a typo that means the OPAL_NPU_TL_SET call
may not clear the enable bits for templates that were previously enabled
but are now disabled.
Fix the typo so we clear NPU2_OTL_CONFIG1_TX_TEMP2_EN as well as
TEMP{1,3}_EN.
Reported-by: Tyler Seredynski <tseredynski@gmail.com>
Fixes: cd8b82a8e83ed ("npu2-opencapi: Add OpenCAPI OPAL API calls")
Cc: stable
Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Reviewed-by: Frederic Barrat <fbarrat@linux.ibm.com>
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
| |
Seeing as all the VERSION signing code is taking way too long to get
upstream, let's temporarily skip verifying VERSION for now.
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
| |
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
| |
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
|
|
|
|
|
|
|
| |
It's wrong!
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
|