summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* Add new LPC-based devmem-aspeed utility04-19-2018Timothy Pearson2018-04-202-2/+228
|
* external: Add "lpc" toolBenjamin Herrenschmidt2018-04-192-0/+193
| | | | | | | This is a little front-end to the lpc debugfs files to access the LPC bus from userspace on the host. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* Mark all partitions except full PNOR and boot kernel firmware read onlyRaptor Engineering Development Team2018-04-191-0/+7
|
* Expose PNOR Flash partitions to host MTD driver via devicetreeRaptor Engineering Development Team2018-04-192-12/+65
|
* Signal skiboot completion to BMC when doneRaptor Engineering Development Team2018-04-192-0/+11
|
* Copy and convert Romulus descriptors to TalosRaptor Engineering Development Team2018-04-192-1/+58
|
* travis-ci: pull Mambo over http rather than ftpStewart Smith2018-04-194-4/+4
| | | | Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* external/trace: fix makefileStewart Smith2018-04-181-1/+1
| | | | Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* core/test/run-trace: fix on ppc64elStewart Smith2018-04-181-1/+2
| | | | | | | Hackish fix from benh Suggested-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* core/fast-reboot: Increase timeout for dctl sreset to 1secVaidyanathan Srinivasan2018-04-181-1/+1
| | | | | | | | | | | | Direct control xscom can take more time to complete. We seem to wait too little on Boston failing fast-reboot for no good reason. Increase timeout to 1 sec as a reasonable value for sreset to be delivered and core to start executing instructions. Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* core: Fix iteration condition to skip garded cpuVaidyanathan Srinivasan2018-04-181-1/+1
| | | | | | | | | Fix the logic error in the loop that iterated incorrectly over garded cpu. Signed-off-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* core/opal: Allow poller re-entry if OPAL was re-enteredNicholas Piggin2018-04-181-4/+8
| | | | | | | | | | | | | | | If an NMI interrupts the middle of running pollers and the OS invokes pollers again (e.g., for console output), the poller re-entrancy check will prevent it from running and spam the console. That check was designed to catch a poller calling opal_run_pollers, OPAL re-entrancy is something different and is detected elsewhere. Avoid the poller recursion check if OPAL has been re-entered. This is a best-effort attempt to cope with errors. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* core/opal: Emergency stack for re-entryNicholas Piggin2018-04-186-16/+52
| | | | | | | | | | | | | | | | | | | | This detects OPAL being re-entered by the OS, and switches to an emergency stack if it was. This protects the firmware's main stack from re-entrancy and allows the OS to use NMI facilities for crash / debug functionality. Further nested re-entry will destroy the previous emergency stack and prevent returning, but those should be rare cases. This stack is sized at 16kB, which doubles the size of CPU stacks, so as not to introduce a regression in primary stack size. The 16kB stack originally had a 4kB machine check stack at the top, which was removed by 80eee1946 ("opal: Remove machine check interrupt patching in OPAL."). So it is possible the size could be tightened again, but that would require further analysis. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* asm/head: implement quiescing without stack or clobbering regsNicholas Piggin2018-04-184-34/+83
| | | | | | | | | | | | | | | | | | | | | | | | Quiescing currently is implmeented in C in opal_entry before the opal call handler is called. This works well enough for simple cases like fast reset when one CPU wants all others out of the way. Linux would like to use it to prevent an sreset IPI from interrupting firmware, which could lead to deadlocks when crash dumping or entering the debugger. Linux interrupts do not recover well when returning back to general OPAL code, due to r13 not being restored. OPAL also can't be re-entered, which may happen e.g., from the debugger. So move the quiesce hold/reject to entry code, beore the stack or r1 or r13 registers are switched. OPAL can be interrupted and returned to or re-entered during this period. This does not completely solve all such problems. OPAL will be interrupted with sreset if the quiesce times out, and it can be interrupted by MCEs as well. These still have the issues above. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* core/stack: backtrace unwind basic OPAL call detailsNicholas Piggin2018-04-183-11/+53
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Put OPAL callers' r1 into the stack back chain, and then use that to unwind back to the OPAL entry frame (as opposed to boot entry, which has a 0 back chain). >From there, dump the OPAL call token and the caller's r1. A backtrace looks like this: CPU 0000 Backtrace: S: 0000000031c03ba0 R: 000000003001a548 ._abort+0x4c S: 0000000031c03c20 R: 000000003001baac .opal_run_pollers+0x3c S: 0000000031c03ca0 R: 000000003001bcbc .opal_poll_events+0xc4 S: 0000000031c03d20 R: 00000000300051dc opal_entry+0x12c --- OPAL call entry token: 0xa caller R1: 0xc0000000006d3b90 --- This is pretty basic for the moment, but it does give you the bottom of the Linux stack. It will allow some interesting improvements in future. First, with the eframe, all the call's parameters can be printed out as well. The ___backtrace / ___print_backtrace API needs to be reworked in order to support this, but it's otherwise very simple (see opal_trace_entry()). Second, it will allow Linux's stack to be passed back to Linux via a debugging opal call. This will allow Linux's BUG() or xmon to also print the Linux back trace in case of a NMI or MCE or watchdog lockup that hits in OPAL. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* opal/hmi: Add documentation for opal_handle_hmi2 callMahesh Salgaonkar2018-04-171-0/+126
| | | | | Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* opal/hmi: Generate hmi event for recovered HDEC parity error.Mahesh Salgaonkar2018-04-173-8/+10
| | | | | Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* opal/hmi: check thread 0 tfmr to validate latched tfmr errors.Mahesh Salgaonkar2018-04-172-19/+50
| | | | | | | | | | | Due to P9 errata, HDEC parity and TB residue errors are latched for non-zero threads 1-3 even if they are cleared. But these are not latched on thread 0. Hence, use xscom SCOMC/SCOMD to read thread 0 tfmr value and ignore them on non-zero threads if they are not present on thread 0. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* opal/hmi: Print additional debug information in rendezvous.Mahesh Salgaonkar2018-04-171-2/+4
| | | | | | | Helps in debugging... Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* opal/hmi: Fix handling of TFMR parity/corrupt error.Mahesh Salgaonkar2018-04-171-5/+4
| | | | | | | | | | | | | | | | | | | While testing TFMR parity/corrupt error it has been observed that HMIs are delivered twice for this error - First time HMI is delivered with HMER[4,5]=1 and TFMR[60]=1. - Second time HMI is delivered with HMER[4,5]=1 and TFMR[60]=0 with valid TB. On second HMI we end up throwing below error message even though TB is in valid state. "HMI: TB invalid without core error reported" This patch fixes this issue by ignoring HMER[5] and checking only for TFMR[60] before setting this_cpu()->tb_invalid to true. Suggested-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* opal/hmi: Stop flooding HMI event for TOD errors.Mahesh Salgaonkar2018-04-171-2/+5
| | | | | | | | | | | | | | Fix the issue where every thread on the chip sends HMI event to host for TOD errors. TOD errors are reported to all the core/threads on the chip. Any one thread can fix the error and send event. Rest of the threads don't need to send HMI event unnecessarily. This patch fixes this by modifying __chiptod_recover_tod_errors() function to return -1 if no errors found. Without this change every thread that see TFMR[51]=1 sends HMI event to the host kernel. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* opal/hmi: Fix soft lockups during TOD errorsMahesh Salgaonkar2018-04-173-3/+28
| | | | | | | | | | | | | | | | | | | | | | | | | There are some TOD errors which do not affect working of TOD and TB. They stay in valid state. Hence we don't need rendez vous for TOD errors that does not affect TB working. TOD errors that affects TOD/TB will report a global error on TFMR[44] alongwith bit 51, and they will go in rendez vous path as expected. But the TOD errors that does not affect TB register sets only TFMR bit 51. The TFMR bit 51 is cleared when any single thread clears the TOD error. Once cleared, the bit 51 is reflected to all the cores on that chip. Any thread that reads the TFMR register after the error is cleared will see TFMR bit 51 reset. Hence the threads that see TFMR[51]=1, falls through rendez-vous path and threads that see TFMR[51]=0, returns doing nothing. This ends up in a soft lockups in host kernel. This patch fixes this issue by not considering TOD interrupt (TFMR[51]) as a core-global error and hence avoiding rendez-vous path completely. Instead threads that see TFMR[51]=1 will now take different path that just do the TOD error recovery. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* opal/hmi: Do not send HMI event if no errors are found.Mahesh Salgaonkar2018-04-173-10/+19
| | | | | | | | | | | | | For TOD errors, all the cores in the chip get HMIs. Any one thread from any core can fix the issue and TFMR will have error conditions cleared. Rest of the threads need take any action if TOD errors are already cleared. Hence thread 0 of every core should get a fresh copy of TFMR before going ahead recovery path. Initialize recover = -1, so that if no errors found that thread need not send a HMI event to linux. This helps in stop flooding host with hmi event by every thread even there are no errors found. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* opal/hmi: Initialize the hmi event with old value of HMER.Mahesh Salgaonkar2018-04-171-3/+6
| | | | | | | | | | | | | | | | | | | | | | Do this before we check for TFAC errors. Otherwise the event at host console shows no error reported in HMER register. Without this patch the console event show HMER with all zeros [ 216.753417] Severe Hypervisor Maintenance interrupt [Recovered] [ 216.753498] Error detail: Timer facility experienced an error [ 216.753509] HMER: 0000000000000000 [ 216.753518] TFMR: 3c12000870e04000 After this patch it shows old HMER values on host console: [ 2237.652533] Severe Hypervisor Maintenance interrupt [Recovered] [ 2237.652651] Error detail: Timer facility experienced an error [ 2237.652766] HMER: 0840000000000000 [ 2237.652837] TFMR: 3c12000870e04000 Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* opal/hmi: Rework HMI handling of TFAC errorsBenjamin Herrenschmidt2018-04-175-372/+280
| | | | | | | | | | | This patch reworks the HMI handling for TFAC errors by introducing 4 rendez-vous points improve the thread synchronization while handling timebase errors that requires all thread to clear dirty data from TB/HDEC register before clearing the errors. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* opal/hmi: Don't bother passing HMER to pre-recovery cleanupBenjamin Herrenschmidt2018-04-171-14/+6
| | | | | | | | The test for TFAC error is now redundant so we remove it and remove the HMER argument. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* opal/hmi: Move timer related error handling to a separate functionBenjamin Herrenschmidt2018-04-171-48/+58
| | | | | | | | Currently no functional change. This is a first step to completely rewriting how these things are handled. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* opal/hmi: Add a new opal_handle_hmi2 that returns direct info to LinuxBenjamin Herrenschmidt2018-04-173-46/+90
| | | | | | | | | It returns a 64-bit flags mask currently set to provide info about which timer facilities were lost, and whether an event was generated. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* opal/hmi: Remove races in clearing HMERBenjamin Herrenschmidt2018-04-171-10/+12
| | | | | | | | | | | | Writing to HMER acts as an "AND". The current code writes back the value we originally read with the bits we handled cleared. This is racy, if a new bit gets set in HW after the original read, we'll end up clearing it without handling it. Instead, use an all 1's mask with only the bit handled cleared. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* opal/hmi: Don't re-read HMER multiple timesBenjamin Herrenschmidt2018-04-171-21/+14
| | | | | | | | | We want to make sure all reporting and actions are based upon the same snapshot of HMER in case bits get added by HW while we are in OPAL. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* libflash/blocklevel: Add missing newline to debug messagesPridhiviraj Paidipeddi2018-04-151-2/+2
| | | | | Signed-off-by: Pridhiviraj Paidipeddi <ppaidipe@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* phb4: Restore bus numbers after CRSMichael Neuling2018-04-113-12/+18
| | | | | | | | | | | Currently we restore PCIe bus numbers right after the link is up. Unfortunately as this point we haven't done CRS so config space may not be accessible. This moves the bus number restore till after CRS has happened. Signed-off-by: Michael Neuling <mikey@neuling.org> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* slots: Add power limit supportOliver O'Halloran2018-04-112-0/+6
| | | | | | | | Add support for sourcing power limit information from either the DT slot heirachy or the slot table. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* phb4: Enable the PCIe slotcap on pluggable slotsOliver O'Halloran2018-04-112-0/+21
| | | | | | | | | | Enables reporting of slot status information, etc in the config space of the root complex. Currently this is only used to set the slot power limit in our generic PCI code, but we might use it for other things later on. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* core/pci: Set slot power limit when supportedOliver O'Halloran2018-04-113-0/+39
| | | | | | | | | | The PCIe slot capability can be implemented in a root or switch downstream port to set the maximum power a card is allowed to draw from the system. This patch adds support for setting the power limit when the platform has defined one. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* romulus: Add a barebones slot tableOliver O'Halloran2018-04-111-0/+30
| | | | | | | | Add slot tables for romulus. Hopefully they won't be needed in THE FUTURE! Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* astbmc: Add more slot table helpersOliver O'Halloran2018-04-111-0/+27
| | | | | | | | Add some helper macros for the common case of a slot, or builtin device directly under a PHB or switch port. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* interrupts: Create an "interrupts" property in the OPAL nodeBenjamin Herrenschmidt2018-04-1110-17/+39
| | | | | | | | | | | | Deprecate the old "opal-interrupts", it's still there, but the new property follows the standard and allow us to specify whether an interrupt is level or edge sensitive. Similarly create "interrupt-names" whose content is identical to "opal-interrupts-names". Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* Makefile: Make it easier to find the docsJoel Stanley2018-04-111-1/+4
| | | | | | | | | Ad a top level 'doc' target that builds the html docs when the user types 'make doc'. Users who want other targets know that the docs live under docs/, so can go looking there. Signed-off-by: Joel Stanley <joel@jms.id.au> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* hdata/spira: parse vpd to add part-number and serial-number to xscom@ nodeStewart Smith2018-04-113-1/+15
| | | | | | | | | Expected by FWTS and associates our processor with the part/serial number, which is obviously a good thing for one's own sanity. Signed-off-by: Stewart Smith <stewart@linux.vnet.ibm.com> Reviewed-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* capi: Keep the current mmio windows in the mbt cache table.Christophe Lombard2018-04-111-32/+36
| | | | | | | | | | | | When the phb is used as a CAPI interface, the current mmio windows list is cleaned before adding the capi and the prefetchable memory (M64) windows, which implies that the non-prefetchable BAR is no more configured. This patch allows to set only the mbt bar to pass capi mmio window and to keep, as defined, the other mmio values (M32 and M64). Signed-off-by: Christophe Lombard <clombard@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* npu2-opencapi: Fix 'link internal error' FIR, take 2Frederic Barrat2018-04-111-4/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | When setting up an opencapi link, we set the transport muxes first, then set the PHY training config register, which includes disabling nvlink mode for the bricks. That's the order of the init sequence, as found in the NPU workbook. In reality, doing so works, but it raises 2 FIR bits in the PowerBus OLL FIR Register for the 2 links when we configure the transport muxes. Presumably because nvlink is not disabled yet and we are configuring the transport muxes for opencapi. bit 60: link0 internal error bit 61: link1 internal error Overall the current setup ends up being correct and everything works, but we raise 2 FIR bits. So tweak the order of operations to disable nvlink before configuring the transport muxes. Incidentally, this is what the scripts from the opencapi enablement team were doing all along. Signed-off-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Acked-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* npu2-opencapi: Fix 'link internal error' FIR, take 1Frederic Barrat2018-04-111-3/+17
| | | | | | | | | | | | | | | When we setup a link, we always enable ODL0 and ODL1 at the same time in the PHY training config register, even though we are setting up only one OTL/ODL, so it raises a "link internal error" FIR bit in the PowerBus OLL FIR Register for the second link. The error is harmless, as we'll eventually setup the second link, but there's no reason to raise that FIR bit. The fix is simply to only enable the ODL we are using for the link. Signed-off-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com> Acked-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* occ: sensors-groups: Add DT properties to mark HWMON sensor groupsShilpasri G Bhat2018-04-111-3/+14
| | | | | | | | | Fix the sensor type to match HWMON sensor types. Add compatible flag to indicate the environmental sensor groups so that operations on these groups can be handled by HWMON linux interface. Signed-off-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* hdata: Move 'HRMOR_BIT' macro to header fileVasant Hegde2018-04-113-5/+8
| | | | | | | | Its already defined twice. And soon I want to use in few other place. Lets move it to header file. Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* core: Correctly load initramfs in stb containerSamuel Mendoza-Jonas2018-04-112-5/+24
| | | | | | | | | | | | | | | Skiboot does not calculate the actual size and start location of the initramfs if it is wrapped by an STB container (for example if loading an initramfs from the ROOTFS partition). Check if the initramfs is in an STB container and determine the size and location correctly in the same manner as the kernel. Since load_initramfs() is called after load_kernel() move the call to trustedboot_exit_boot_services() into load_and_boot_kernel() so it is called after both of these. Signed-off-by: Samuel Mendoza-Jonas <sam@mendozajonas.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* Disable stop states from OPALStewart Smith2018-04-111-0/+23
| | | | | | | | | | | | | | | | | | | | | On ZZ, stop4,5,11 are enabled for PHYP, even though doing so may cause problems with OPAL due to bugs in hcode. For other platforms, this isn't so much of an issue as we can just control stop states by the MRW. However the rebuild-the-world approach to changing values there is a bit annoying if you just want to rule out a specific stop state from being problematic. Provide an nvram option to override what's disabled in OPAL. The OPAL mask is currently ~0xE0000000 (i.e. all but stop 0,1,2) You can set an NVRAM override with: nvram -p ibm,skiboot --update-config opal-stop-state-disable-mask=0xFFFFFFF This nvram override will disable *all* stop states. Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* opal-prd: Insert powernv_flash moduleVasant Hegde2018-04-101-1/+8
| | | | | | | | | | | | | | Explictly load powernv_flash module on BMC based system so that we are sure that flash device is created before starting opal-prd daemon. Note that I have replaced pnor_available() check with is_fsp_system(). As we want to load module on BMC system only. Also pnor_init has enough logic to detect flash device. Hence pnor_available() becomes redundant check. Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> CC: Jeremy Kerr <jeremy.kerr@au1.ibm.com> CC: Stewart Smith <stewart@linux.vnet.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* hdat/i2c.c: quieten "v2 found, parsing as v1"Stewart Smith2018-04-101-2/+18
| | | | | Suggested-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
* npu2: Move NPU2_XTS_BDF_MAP_VALID assignment to context initReza Arbab2018-04-101-7/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | A bad GPU or other condition may leave us with a subset of links that never get initialized. If an ATSD is sent to one of those bricks, it will never complete, leaving us waiting forever for a response: watchdog: BUG: soft lockup - CPU#23 stuck for 23s! [acos:2050] ... Modules linked in: nvidia_uvm(O) nvidia(O) CPU: 23 PID: 2050 Comm: acos Tainted: G W O 4.14.0 #2 task: c0000000285cfc00 task.stack: c000001fea860000 NIP: c0000000000abdf0 LR: c0000000000acc48 CTR: c0000000000ace60 REGS: c000001fea863550 TRAP: 0901 Tainted: G W O (4.14.0) MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 28004484 XER: 20040000 CFAR: c0000000000abdf4 SOFTE: 1 GPR00: c0000000000acc48 c000001fea8637d0 c0000000011f7c00 c000001fea863820 GPR04: 0000000002000000 0004100026000000 c0000000012778c8 c00000000127a560 GPR08: 0000000000000001 0000000000000080 c000201cc7cb7750 ffffffffffffffff GPR12: 0000000000008000 c000000003167e80 NIP [c0000000000abdf0] mmio_invalidate_wait+0x90/0xc0 LR [c0000000000acc48] mmio_invalidate.isra.11+0x158/0x370 ATSDs are only sent to bricks which have a valid entry in the XTS_BDF table. So to prevent the hang, don't set NPU2_XTS_BDF_MAP_VALID unless we make it all the way to creating a context for the BDF. Signed-off-by: Reza Arbab <arbab@linux.ibm.com> Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
OpenPOWER on IntegriCloud