summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
...
* libflash: Add ipmi-hiomapAndrew Jeffery2018-10-117-18/+972
| | | | | | | | | | | | | | | | | | | | | | ipmi-hiomap implements the PNOR access control protocol formerly known as "the mbox protocol" but uses IPMI instead of the AST LPC mailbox as a transport. As there is no-longer any mailbox involved in this alternate implementation the old protocol name is quite misleading, and so it has been renamed to "the hiomap protoocol" (Host I/O Mapping protocol). The same commands and events are used though this client-side implementation assumes v2 of the protocol is supported by the BMC. The code is a heavily-reworked copy of the mbox-flash source and is introduced this way to allow for the mbox implementation's eventual removal. mbox-flash should in theory be renamed to mbox-hiomap for consistency, but as it is on life-support effective immediately we may as well just remove it entirely when the time is right. Signed-off-by: Andrew Jeffery <andrew@aj.id.au> [stewart: prlog debug over prerror for mbox fallback, fix indent] Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* core/flash: Unlock around blocklevel calls in NVRAM accessorsAndrew Jeffery2018-10-101-0/+11
| | | | | | | This ensures progress when we don't have interrupts available for IPMI. Signed-off-by: Andrew Jeffery <andrew@aj.id.au> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* core/flash: Only lock around flashes update in flash_register()Andrew Jeffery2018-10-101-6/+2
| | | | | | | | | | | | Previously in flash_register() held flash_lock across ffs_init(), which calls through the blocklevel layer to read the flash. This is unhelpful with the IPMI HIOMAP protocol transport as LPC interrupts have not yet been enabled and we are relying on polling to progress. The held lock stalls the boot as we take the nopoll path in time_wait() while completing ipmi_queue_msg_sync() in libflash/ipmi-flash.c Signed-off-by: Andrew Jeffery <andrew@aj.id.au> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* core/lock: Use try_lock_caller() in lock_caller() to capture ownerAndrew Jeffery2018-10-101-1/+1
| | | | | | | | Otherwise we can get reports of core/lock.c owning the lock, which is not helpful when tracking down ownership issues. Signed-off-by: Andrew Jeffery <andrew@aj.id.au> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* ipmi: Introduce registration for SEL command handlersAndrew Jeffery2018-10-102-29/+94
| | | | | Signed-off-by: Andrew Jeffery <andrew@aj.id.au> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* core/lock: don't set bust_locks on lock errorNicholas Piggin2018-10-101-2/+0
| | | | | | | | | | | | | | | | | | | | | | bust_locks is a big hammer that guarantees a mess if it's set while all other threads are not stopped. I propose removing this in the lock error paths. In debugging the previous deadlock false positive, none of the error messages printed, and the in-memory console was totally garbled due to lack of locking. I think it's generally better for debugging and system integrity to keep locks held when lock errors occur. Lock busting should be used carefully, just to allow messages to be printed out or machine to be restarted, probably when the whole system is single-threaded. Skiboot is slowly working toward that being feasible with co-operative debug APIs between firmware and host, but for the time being, difficult lock crashes are better not to corrupt everything by busting locks. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* core/lock: fix timeout warning causing a deadlock false positiveNicholas Piggin2018-10-101-6/+15
| | | | | | | | | | | | | | | If a lock waiter exceeds the warning timeout, it prints a message while still registered as requesting the lock. Printing the message can take locks, so if one is held when the owner of the original lock tries to print a message, it will get a false positive deadlock detection, which brings down the system. This can easily be hit when there is a lot of HMI activity from a KVM guest, where the timebase was not returned to host timebase before calling the HMI handler. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* phb4: Generate checkstop on AIB ECC corr/uncorr for DD2.0 partsMichael Neuling2018-09-271-9/+34
| | | | | | | | | | | | | On DD2.0 parts, PCIe ECC protection is not warranted in the response data path. Thus, for these parts, we need to flag any ECC errors detected from the adjacent AIB RX Data path so the part can be replaced. This patch configures the FIRs so that we escalate these AIB ECC errors to a checkstop so the parts can be replaced. Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* init: Fix starting stripped kernelMichael Neuling2018-09-271-0/+1
| | | | | | | | | | | | | | | | | | | | Currently if we try to run a raw/stripped binary kernel (ie. without the elf header) we crash with: [ 0.008757768,5] INIT: Waiting for kernel... [ 0.008762937,5] INIT: platform wait for kernel load failed [ 0.008768171,5] INIT: Assuming kernel at 0x20000000 [ 0.008779241,3] INIT: ELF header not found. Assuming raw binary. [ 0.017047348,5] INIT: Starting kernel at 0x0, fdt at 0x3044b230 14339 bytes [ 0.017054251,0] FATAL: Kernel is zeros, can't execute! [ 0.017059054,0] Assert fail: core/init.c:590:0 [ 0.017065371,0] Aborting! This is because we haven't set kernel_entry correctly in this path. This fixes it. Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* opal/hmi: Handle early HMIs on thread0 when secondaries are still in OPAL.Mahesh Salgaonkar2018-09-271-0/+49
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When primary thread receives a CORE level HMI for timer facility errors while secondaries are still in OPAL, thread 0 ends up in rendez-vous waiting for secondaries to get into hmi handling. This is because OPAL runs with MSR(EE=0) and hence HMIs are delayed on secondary threads until they are given to Linux OS. Fix this by adding a check for secondary state and force them in hmi handling by queuing job on secondary threads. I have tested this by injecting HDEC parity error very early during Linux kernel boot. Recovery works fine for non-TB errors. But if TB is bad at this very eary stage we already doomed. Without this patch we see: [ 285.046347408,7] OPAL: Start CPU 0x0843 (PIR 0x0843) -> 0x000000000000a83c [ 285.051160609,7] OPAL: Start CPU 0x0844 (PIR 0x0844) -> 0x000000000000a83c [ 285.055359021,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000 [ 285.055361439,7] HMI: [Loc: U78D3.ND1.WZS004A-P1-C48]: P:8 C:17 T:0: TFMR(2e12002870e14000) Timer Facility Error [ 286.232183823,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 1 (sptr=0000ccc1) [ 287.409002056,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 2 (sptr=0000ccc1) [ 289.073820164,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 3 (sptr=0000ccc1) [ 290.250638683,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 1 (sptr=0000ccc2) [ 291.427456821,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 2 (sptr=0000ccc2) [ 293.092274807,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 3 (sptr=0000ccc2) [ 294.269092904,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 1 (sptr=0000ccc3) [ 295.445910944,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 2 (sptr=0000ccc3) [ 297.110728970,3] HMI: Rendez-vous stage 1 timeout, CPU 0x844 waiting for thread 3 (sptr=0000ccc3) After this patch: [ 259.401719351,7] OPAL: Start CPU 0x0841 (PIR 0x0841) -> 0x000000000000a83c [ 259.406259572,7] OPAL: Start CPU 0x0842 (PIR 0x0842) -> 0x000000000000a83c [ 259.410615534,7] OPAL: Start CPU 0x0843 (PIR 0x0843) -> 0x000000000000a83c [ 259.415444519,7] OPAL: Start CPU 0x0844 (PIR 0x0844) -> 0x000000000000a83c [ 259.419641401,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000 [ 259.419644124,7] HMI: [Loc: U78D3.ND1.WZS004A-P1-C48]: P:8 C:17 T:0: TFMR(2e12002870e04000) Timer Facility Error [ 259.419650678,7] HMI: Sending hmi job to thread 1 [ 259.419652744,7] HMI: Sending hmi job to thread 2 [ 259.419653051,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000 [ 259.419654725,7] HMI: Sending hmi job to thread 3 [ 259.419654916,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000 [ 259.419658025,7] HMI: Received HMI interrupt: HMER = 0x0840000000000000 [ 259.419658406,7] HMI: [Loc: U78D3.ND1.WZS004A-P1-C48]: P:8 C:17 T:2: TFMR(2e12002870e04000) Timer Facility Error [ 259.419663095,7] HMI: [Loc: U78D3.ND1.WZS004A-P1-C48]: P:8 C:17 T:3: TFMR(2e12002870e04000) Timer Facility Error [ 259.419655234,7] HMI: [Loc: U78D3.ND1.WZS004A-P1-C48]: P:8 C:17 T:1: TFMR(2e12002870e04000) Timer Facility Error [ 259.425109779,7] OPAL: Start CPU 0x0845 (PIR 0x0845) -> 0x000000000000a83c [ 259.429870681,7] OPAL: Start CPU 0x0846 (PIR 0x0846) -> 0x000000000000a83c [ 259.434549250,7] OPAL: Start CPU 0x0847 (PIR 0x0847) -> 0x000000000000a83c Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* witherspoon: Rename shared slot fixup functionOliver O'Halloran2018-09-271-7/+2
| | | | | | | | Rename and set it as a pre_pci_fixup platform function. The indirect call doesn't make a whole of of sense IMO. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* hdata: Explain what the xscom node bus-frequency isOliver O'Halloran2018-09-271-1/+4
| | | | | | | Vague documentation is about as annoying as no documentation. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* hw/p8-i2c: Fix i2c request timeoutFrederic Barrat2018-09-272-7/+5
| | | | | | | | | | | | | | | | Commit eb146fac9685 ("core/i2c: Move the timeout field into i2c_request") simplified a bit how a request timeout is handled. However there's now some confusion between milliseconds and timebase increments when defining or using the timeout values, which breaks i2c requests made for opencapi, and probably others too. This patch declares all the timeout in milliseconds and just converts to timebase at the end of the chain, as needed. Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com> Tested-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Reviewed-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* opal-ci: Build old dtc version for fedora 28Stewart Smith2018-09-202-1/+34
| | | | | | | There are patches that will go into dtc to fix the issues we hit, but for the moment let's just build and use a slightly older version. Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* FSP: Improve Reset/Reload log messageVasant Hegde2018-09-201-2/+2
| | | | | | | | | | | | | | Below message is confusing. Lets make it clear. FSP sends "R/R complete notification" whenever there is a dump. We use `flag` to identify whether its its R/R completion -OR- just new dump notification. [ 483.406351956,6] FSP: SP says Reset/Reload complete [ 483.406354278,5] DUMP: FipS dump available. ID = 0x1a00001f [size: 6367640 bytes] [ 483.406355968,7] A Reset/Reload was NOT done Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* mambo: Merge PMEM_DISK and PMEM_VOLATILE codeMichael Neuling2018-09-201-46/+31
| | | | | | | | | | PMEM_VOLATILE and PMEM_DISK can't be used together and are basically copies of the same code. This merges the two and allows them used together. Same API is kept. Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* phb4: Re-factor phb4_fenced() and introduce phb4_dump_pec_err_regs()Vaibhav Jain2018-09-201-30/+38
| | | | | | | | | | | Couple of places in 'phb4.c' where we may want to dump the PEC's error registers. Hence we introduce a phb4_dump_pec_err_regs() that dumps all the PEC error registers and also update phb4->nfir_cache & phb4->pfir_cache for later use. Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com> Reviewed-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* phb4: Reset pfir and nfir if new errors reported during ETU resetVaibhav Jain2018-09-201-0/+19
| | | | | | | | | | | | | | | | | | | | | | During fast-reboot new PEC errors can be latched even after ETU-Reset is asserted. This will result in values of variables nfir_cache and pfir_cache to be out of sync. During step-2 of CRESET nfir_cache and pfir_cache values are used to bring the PHB out of reset state. However if these variables are out as noted above of date the nfir/pfir registers are never reset completely and ETU still remains frozen. Hence this patch updates step-2 of phb4_creset to re-read the values of nfir/pfir registers to check if any new errors were reported after ETU-reset was asserted, report these new errors and reset the nfir/pfir registers. This should bring the ETU out of reset successfully. Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com> Tested-By: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Reviewed-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* xscom-utils: Rework getsramOliver O'Halloran2018-09-201-7/+47
| | | | | | | | | Allow specifying a file on the command line to read OCC SRAM data into. If no file is specified then we print it to stdout as text. This is a bit inconsistent, but it retains compatibility with the existing tool. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* xscom-utils/getsram: Make it work on P9Oliver O'Halloran2018-09-201-23/+58
| | | | | | | | | The XSCOM base address of the OCC control registers changed slightly between P8 and P9. Fix this up and add a bit of PVR checking so we look in the right place. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* fast-reboot: verify firmware "romem" checksumNicholas Piggin2018-09-203-0/+55
| | | | | | | | | | | | | | | | This takes a checksum of skiboot memory after boot that should be unchanged during OS operation, and verifies it before allowing a fast reboot. This is not read-only memory from skiboot's point of view, beause it includes things like the opal branch table that gets populated during boot. This helps to improve the integrity of firmware against host and runtime firmware memory scribble bugs. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* skiboot.lds.S: move read-write data after the end of symbol mapNicholas Piggin2018-09-203-19/+31
| | | | | | | | This also tidies up linker script symbol declarations and adds _rodata_mem symbol for the next change to use. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* core/mem_region: mambo reserve kernel payload areasNicholas Piggin2018-09-2011-3/+40
| | | | | | | | | | Mambo image payloads get overwritten by the OS and by fast reboot memory clearing because they have no region defined. Add them, which allows fast reboot to work. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> [stewart: fix up 'make check'] Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* core/fast-reboot: print the fast reboot disable reasonNicholas Piggin2018-09-191-5/+7
| | | | | | | | | | Once things start to go wrong, disable_fast_reboot can be called a number of times, so make the first reason sticky, and also print it to the console at disable time. This helps with making sense of fast reboot disables. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* Actually add /ibm,opal/fast-reboot propertyStewart Smith2018-09-181-0/+11
| | | | | | | | I missed a hunk when merging :( Reported-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Fixes: 7c8e1c6f89f3aac77661cfcee75ab515bd053d75 Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* Add fast-reboot property to /ibm,opal DT nodeStewart Smith2018-09-183-0/+13
| | | | | | | this means that if it's permanently disabled on boot, the test suite can pick that up and not try a fast reboot test. Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* nvram: Fix wait-for-nvram messageOliver O'Halloran2018-09-171-2/+3
| | | | | | | | | | | We print a message when nvram_query() needs to wait for the NVRAM to be loaded from the BMC/FSP. Currently this is printed at PR_WARNING which is excessive since this doesn't actually indicate that anything is wrong. There's also nothing that we can really do about loading the NVRAM being slow, so just print this at PR_DEBUG. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* nvram: Print how long we waited for nvramOliver O'Halloran2018-09-172-0/+11
| | | | | | | | Print how long we had to wait for NVRAM to become available if we needed to wait. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* nvram: Fix a possible NULL pointer de-ref in nvram_query_eq()Vaibhav Jain2018-09-171-0/+9
| | | | | | | | | | | | | | A fault will occur if 'value == NULL' is passed to nvram_query_eq() to check if a given key doesn't exists in nvram partition. This is an invalid use of the API as its only supposed to be used for keys that exist in nvram and 'value == NULL' is never possible. Hence this patch adds an assert to the function to flag such a use and also prevent NULL being passed as an argument to strcmp(). Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com> Suggested-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* errorlog: Rename PHB3 to just PHBRussell Currey2018-09-171-2/+2
| | | | | | | | | | | | | I don't see a reason why there would need to be a PHB3 *specific* subsystem in the error logs, so rename it to PHB so that PHB4 and later can use it too without continually redefining it. This shouldn't change any existing assumptions because it's unused. Signed-off-by: Russell Currey <ruscur@russell.cc> Reviewed-by: Oliver O'Halloran <oohall@gmail.com> Reviewed-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* SBE-p8: Do all sbe timer update with xscom lock heldStewart Smith2018-09-171-3/+4
| | | | | | | | | | | | | | Without this, on some P8 platforms, we could (falsely) think the SBE timer had stalled getting the dreaded "timer stuck" message. The code was doing the mftb() to set the start of the timeout period while *not* holding the lock, so the 1ms timeout started sometime when somebody else had the xscom lock. The simple solution is to just do the whole routine holding the xscom lock, so do it that way. Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* phb4: Fix typo in disable lane eq codeMichael Neuling2018-09-171-1/+1
| | | | | | | | | | | | | | | | | In this commit commit 737c0ba3d72b8aab05a765a9fc111a48faac0f75 Author: Michael Neuling <mikey@neuling.org> Date: Thu Feb 22 10:52:18 2018 +1100 phb4: Disable lane eq when retrying some nvidia GEN3 devices We made a typo and set PH2 twice. This fixes it. It worked previously as if only phase 2 (PH2) is set it, skips phase 2 and phase 3 (PH3). Reported-by: Meng Li <shlimeng@cn.ibm.com> Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* core/i2c: Remove bus specific alloc and free callbacksOliver O'Halloran2018-09-174-31/+10
| | | | | | | These are now pointless and they can be replaced with zalloc() and free(). Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* hw/p8-i2c: Remove p8_i2c_request structureOliver O'Halloran2018-09-171-31/+3
| | | | | | | | | The p8_i2c_request structure is barely used and the only useful data it contains (port_num) can be derived from the bus pointer. Remove it in preperation for removing the per-bus allocation and free methods. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* core/i2c: Move the timeout field into i2c_requestOliver O'Halloran2018-09-174-46/+19
| | | | | | | | | | | Currently to set a per-request timeout you need to use i2c_req_set_timeout() which is a wrapper for a per-bus method that sets the actual timeout. This design doesn't make a whole lot of sense, so move the timeout field into the generic i2c_request structure and set the timeout to be set using that. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* opal/hmi: Ignore debug trigger inject core FIR.Mahesh Salgaonkar2018-09-171-1/+0
| | | | | | | | | | | | | Core FIR[60] is a side effect of the work around for the CI Vector Load issue in DD2.1. Usually this gets delivered as HMI with HMER[17] where Linux already ignores it. But it looks like in some cases we may happen to see CORE_FIR[60] while we are already in Malfunction Alert HMI (HMER[0]) due to other reasons e.g. CAPI recovery or NPU xstop. If that happens then just ignore it instead of crashing kernel as not recoverable. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Reviewed-by: Vaibhav Jain <vaibhav@linux.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* opal: use for_each_safe to iterate over opal_syncersVaibhav Jain2018-09-171-2/+2
| | | | | | | | | | | | | | | | | | | | | Presently a fault will happen in opal_sync_host_reboot if a callback tries to remove itself from the opal_syncers list by calling opal_del_host_sync_notifier. This happens as iteration over opal_syncers is done using the list_for_each() which doesn't preserve list_node->next. So when the current opal_syncers callback removes itself from the list, current node contents are lost and current_node->next pointer is rendered invalid. To fix this we simply switch from list_for_each() to list_for_each_safe() which keeps the current_node->next cached hence even if the current node is freed, iteration over subsequent nodes can still continue. Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com> Reviewed-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* platforms/astbmc/witherspoon: Implement OpenCAPI supportAndrew Donnellan2018-09-171-3/+202
| | | | | | | | | | | | | | | | | | | | | | OpenCAPI on Witherspoon is slightly more involved than on Zaius and ZZ, due to the OpenCAPI links using the SXM2 connectors that are used for NVLink GPUs. This patch adds the regular OpenCAPI platform information, and also a Witherspoon-specific presence detection callback that uses the previously added OCC GPU presence detection to figure out the device types plugged into each SXM2 socket. The SXM2 connectors are capable of carrying 2 OpenCAPI links, and future OpenCAPI devices are expected to make use of this. However, we don't yet support ganged links and the various implications that has for handling things like device reset, so for now, we only enable 1 brick per device. Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Acked-by: Reza Arbab <arbab@linux.ibm.com> Reviewed-by: Alistair Popple <alistair@popple.id.au> Reviewed-by: Frederic Barrat <fbarrat@linux.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* hw/npu2, platform: Restructure OpenCAPI i2c reset/presence pinsAndrew Donnellan2018-09-176-38/+64
| | | | | | | | | | | | | | | | | | | | | | In platform_ocapi, we define i2c_{reset,presence}_odl{0,1} to specify the appropriate reset/presence GPIO pins for devices connected to ODL0 and ODL1 respectively. This is obviously wrong, because a device connected to brick 2 and a device connected to brick 4 are going to be different devices connected to different I2C pins, but rather conveniently we haven't had to deal with systems that can use the full 4 bricks as yet. Now that we're adding OpenCAPI support for Witherspoon, we should change this to specify pins separately for all 4 bricks. Replace i2c_{reset,presence}_odl{0,1} with i2c_{reset,presence}_brick{2,3,4,5} and update the presence detection code, device reset code, and existing platforms accordingly. Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Reviewed-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Reviewed-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* hw/npu2, platform: Add NPU2 platform device detection callbackAndrew Donnellan2018-09-179-97/+144
| | | | | | | | | | | | | | | | | | | | | | | | | | | | There is no standardised way to determine the presence and type of devices connected to an NPU on POWER9. Currently, we hardcode device types based on platform type (as no platform currently supports both OpenCAPI and NVLink), and for OpenCAPI platforms we use I2C to detect presence. Witherspoon (and potentially other platforms later on) supports both NVLink and OpenCAPI, and additionally uses SXM2 connectors which can carry more than one link, rather than the SlimSAS connectors used for OpenCAPI on Zaius and ZZ. This necessitates some special handling. Add a platform callback for NPU device detection. In a later patch, we will use this to implement Witherspoon-specific device detection. For now, add a Witherspoon stub that sets all links to NVLink (i.e. current behaviour). Move the existing I2C-based presence detection for OpenCAPI devices on Zaius/ZZ into common code, which we use by default for platforms which do not define a callback. Clean up the use of the ibm,npu-link-type property, which will now be exposed solely for debugging and not consumed internally. Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Reviewed-by: Frederic Barrat <fbarrat@linux.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* hw/npu2: Common NPU2 init routine between NVLink and OpenCAPIAndrew Donnellan2018-09-176-271/+250
| | | | | | | | | | | | | | | Replace probe_npu2() and probe_npu2_opencapi() with a new shared probe_npu2(). Refactor some of the common NPU setup code into shared code. No functional change. This patch does not implement support for using both types of devices simultaneously on the same NPU - we expect to add this sometime in the future. Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Acked-by: Reza Arbab <arbab@linux.ibm.com> Reviewed-by: Alistair Popple <alistair@popple.id.au> Reviewed-by: Frederic Barrat <fbarrat@linux.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* npu2: Split device index into brick and link indexAndrew Donnellan2018-09-175-60/+67
| | | | | | | | | | | | | | On Witherspoon, OpenCAPI devices attached to link indexes 0 and 1 are handled by bricks 2 and 3. Rename index to brick_index, and add a new field, link_index, to refer to the link index. For now, we set those values identically. Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Acked-by: Reza Arbab <arbab@linux.ibm.com> Reviewed-by: Alistair Popple <alistair@popple.id.au> Reviewed-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* occ: Wait if OCC GPU presence status not immediately availableAndrew Donnellan2018-09-171-3/+13
| | | | | | | | | | | | | | | | It takes a few seconds for the OCC to set everything up in order to read GPU presence. At present, we try to kick off OCC initialisation as early as possible to maximise the time it has to read GPU presence. Unfortunately sometimes that's not enough, so add a loop in occ_get_gpu_presence() so that on the first time we try to get GPU presence we keep trying for up to 2 seconds. Experimentally this seems to be adequate. Fixes: 9b394a32c8ea ("occ: Add support for GPU presence detection") Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Reviewed-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* npu2: Use correct kill type for TCE invalidationAlexey Kardashevskiy2018-09-171-1/+1
| | | | | | | | | | | kill_type is enum of OPAL_PCI_TCE_KILL_PAGES, OPAL_PCI_TCE_KILL_PE, OPAL_PCI_TCE_KILL_ALL and phb4_tce_kill() gets it right but npu2_tce_kill() uses OPAL_PCI_TCE_KILL which is an OPAL API token. This fixes an obvious mistype. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* hw/npu2-hw-procedures: Enable RX auto recal on OpenCAPI linksAndrew Donnellan2018-09-171-0/+8
| | | | | | | | | | | | | | | | The RX_RC_ENABLE_AUTO_RECAL flag is required on OpenCAPI but not NVLink. Traditionally, Hostboot sets this value according to the machine type. However, now that Witherspoon supports both NVLink and OpenCAPI, it can't tell whether or not a link is OpenCAPI. So instead, set it in skiboot, where it will only be triggered after we've done device detection and found an OpenCAPI device. Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Reviewed-by: Frederic Barrat <fbarrat@linux.ibm.com> Acked-by: Reza Arbab <arbab@linux.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* hw/npu2-opencapi: Fix setting of supported OpenCAPI templatesAndrew Donnellan2018-09-171-2/+2
| | | | | | | | | | | | | | | | In opal_npu_tl_set(), we made a typo that means the OPAL_NPU_TL_SET call may not clear the enable bits for templates that were previously enabled but are now disabled. Fix the typo so we clear NPU2_OTL_CONFIG1_TX_TEMP2_EN as well as TEMP{1,3}_EN. Reported-by: Tyler Seredynski <tseredynski@gmail.com> Fixes: cd8b82a8e83ed ("npu2-opencapi: Add OpenCAPI OPAL API calls") Cc: stable Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Reviewed-by: Frederic Barrat <fbarrat@linux.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* TEMPORARY HACK: Disable verifying VERSIONStewart Smith2018-09-131-1/+6
| | | | | | | Seeing as all the VERSION signing code is taking way too long to get upstream, let's temporarily skip verifying VERSION for now. Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* xscom-utils/adu_scoms.py: run 2to3 over itStewart Smith2018-09-131-21/+21
| | | | Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* clang: -Wno-error=ignored-attributesStewart Smith2018-09-131-1/+2
| | | | Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* hdata/iohub: Fix Cumulus Hub ID numberOliver O'Halloran2018-09-131-1/+1
| | | | | | | It's wrong! Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com>
OpenPOWER on IntegriCloud