talos-op-linux - Talos™ II Linux sources for OpenPOWER

	Commit message (Collapse)	Author	Age	Files	Lines
*	powernv/pci: Move pnv_pci_dma_bus_setup() to pci-ioda.c	Oliver O'Halloran	2020-01-23	3	-22/+21
\| \| \| \| \| \| \| \| \|	This is only used in pci-ioda.c so move it there and rename it to match. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200110070207.439-6-oohall@gmail.com
*	powernv/pci: Fold pnv_pci_dma_dev_setup() into the pci-ioda.c version	Oliver O'Halloran	2020-01-23	3	-14/+4
\| \| \| \| \| \| \| \| \| \| \| \|	pnv_pci_dma_dev_setup() does nothing but call the phb->dma_dev_setup() callback, if one exists. That callback is only set for normal PCIe PHBs so we can remove the layer of indirection and use the ioda version in the pci_controller_ops. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200110070207.439-5-oohall@gmail.com
*	powerpc/iov: Move VF pdev fixup into pcibios_fixup_iov()	Oliver O'Halloran	2020-01-23	2	-18/+25
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	An ioda_pe for each VF is allocated in pnv_pci_sriov_enable() before the pci_dev for the VF is created. We need to set the pe->pdev pointer at some point after the pci_dev is created. Currently we do that in: pcibios_bus_add_device() pnv_pci_dma_dev_setup() (via phb->ops.dma_dev_setup) /* fixup is done here */ pnv_pci_ioda_dma_dev_setup() (via pnv_phb->dma_dev_setup) The fixup needs to be done before setting up DMA for for the VF's PE, but there's no real reason to delay it until this point. Move the fixup into pnv_pci_ioda_fixup_iov() so the ordering is: pcibios_add_device() pnv_pci_ioda_fixup_iov() (via ppc_md.pcibios_fixup_sriov) pcibios_bus_add_device() ... This isn't strictly required, but it's a slightly more logical place to do the fixup and it simplifies pnv_pci_dma_dev_setup(). Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200110070207.439-4-oohall@gmail.com
*	powernv/pci: Remove dma_dev_setup() for NPU PHBs	Oliver O'Halloran	2020-01-23	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	The pnv_pci_dma_dev_setup() only does something when: 1) There PHB contains VFs, or 2) The PHB defines a dma_dev_setup() callback in the pnv_phb structure. Neither is true for NPU PHBs so there's no reason to set the callback. Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20200110070207.439-3-oohall@gmail.com
*	powerpc/powernv: Allow manually invoking special reboots	Oliver O'Halloran	2020-01-23	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \| \|	OPAL provides several different kinds of reboot for the kernel to use, namely forcing a full reboot, platform error reboot and MPIPL. Right now triggering the alternative resets requires some ad-hoc method such as triggering a kernel crash and hoping the stars align. It's sometimes handy to be able to trigger one of these resets directly, so add a way to do that. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20191101085522.3055-2-oohall@gmail.com
*	powerpc/powernv: Use common code for the symbol_map export	Oliver O'Halloran	2020-01-23	1	-38/+10
\| \| \| \| \| \| \| \| \| \|	Long before we had a generic way for firmware to export memory ranges of interest we added a special case for the skiboot symbol map. The code is pretty much identical to the generic export so re-use the code. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20191101062611.32610-2-oohall@gmail.com
*	powerpc/powernv: Rework exports to support subnodes	Oliver O'Halloran	2020-01-23	1	-42/+72
\| \| \| \| \| \| \| \| \| \| \|	Originally we only had a handful of exported memory ranges, but we'd to export the per-core trace buffers. This results in a lot of files in the exports directory which is a but unfortunate. We can clean things up a bit by turning subnodes into subdirectories of the exports directory. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20191101062611.32610-1-oohall@gmail.com
*	powernv/pci: Add a debugfs entry to dump PHB's IODA PE state	Oliver O'Halloran	2020-01-23	1	-0/+29
\| \| \| \| \| \| \| \| \| \| \| \| \|	Add a debugfs entry to dump the state of the active IODA PEs. The IODA PE state reflects how the PHB's internal concept of a PE is configured. This is separate to the EEH PE state and is managed power the PowerNV PCI backend rather than the EEH core. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> [mpe: Use DEFINE_DEBUGFS_ATTRIBUTE] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190912052945.12589-3-oohall@gmail.com
*	powernv/pci: Allow any write trigger the diag dump	Oliver O'Halloran	2020-01-23	1	-3/+0
\| \| \| \| \| \| \| \| \| \|	Make the dump trigger off any input rather than just '1'. This allows you to write "echo 1> dump_diag_data" and it'll do what you want rather than erroring out pointlessly. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190912052945.12589-2-oohall@gmail.com
*	powernv/pci: Use pnv_phb as the private data for debugfs entries	Oliver O'Halloran	2020-01-23	1	-9/+2
\| \| \| \| \| \| \| \| \| \| \|	Use the pnv_phb structure as the private data pointer for the debugfs files. This lets us delete some code and an open-coded use of hose->private_data. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190912052945.12589-1-oohall@gmail.com
*	powerpc/pcidn: Make VF pci_dn management CONFIG_PCI_IOV specific	Oliver O'Halloran	2020-01-23	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The powerpc PCI code requires that a pci_dn structure exists for all devices in the system. This is fine for real devices since at boot a pci_dn is created for each PCI device in the DT and it's fine for hotplugged devices since the hotplug slot driver will manage the pci_dn's devices in hotplug slots. For SR-IOV, we need the platform / pcibios to manage the pci_dn for virtual functions since firmware is unaware of VFs, and they aren't "hot plugged" in the traditional sense. Management of the pci_dn is handled by the, poorly named, functions: add_pci_dev_data() and remove_pci_dev_data(). The entire body of these functions is #ifdef`ed around CONFIG_PCI_IOV and they cannot be used in any other context, so make them only available when CONFIG_PCI_IOV is selected, and rename them to reflect their actual usage rather than having them masquerade as generic code. Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Reviewed-by: Sam Bobroff <sbobroff@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190821062655.19735-2-oohall@gmail.com
*	powerpc/powernv/ioda: Find opencapi slot for a device node	Frederic Barrat	2020-01-23	1	-10/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Unlike real PCI slots, opencapi slots are directly associated to the (virtual) opencapi PHB, there's no intermediate bridge. So when looking for a slot ID, we must start the search from the device node itself and not its parent. Also, the slot ID is not attached to a specific bdfn, so let's build it from the PHB ID, like skiboot. Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com> Reviewed-by: Andrew Donnellan <ajd@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20191121134918.7155-6-fbarrat@linux.ibm.com
*	powerpc/powernv/ioda: Release opencapi device	Frederic Barrat	2020-01-23	1	-19/+40
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	With hotplug, an opencapi device can now go away. It needs to be released, mostly to clean up its PE state. We were previously not defining any device callback. We can reuse the standard PCI release callback, it does a bit too much for an opencapi device, but it's harmless, and only needs minor tuning. Also separate the undo of the PELT-V code in a separate function, it is not needed for NPU devices and it improves a bit the readability of the code. Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com> Reviewed-by: Andrew Donnellan <ajd@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20191121134918.7155-5-fbarrat@linux.ibm.com
*	powerpc/powernv/ioda: set up PE on opencapi device when enabling	Frederic Barrat	2020-01-23	1	-8/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The PE for an opencapi device was set as part of a late PHB fixup operation, when creating the PHB. To use the PCI hotplug framework, this is not going to work, as the PHB stays the same, it's only the devices underneath which are updated. For regular PCI devices, it is done as part of the reconfiguration of the bridge, but for opencapi PHBs, we don't have an intermediate bridge. So let's define the PE when the device is enabled. PEs are meaningless for opencapi, the NPU doesn't define them and opal is not doing anything with them. Reviewed-by: Alastair D'Silva <alastair@d-silva.org> Reviewed-by: Andrew Donnellan <ajd@linux.ibm.com> Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20191121134918.7155-4-fbarrat@linux.ibm.com
*	powerpc/powernv/ioda: Protect PE list	Frederic Barrat	2020-01-23	1	-1/+5
\| \| \| \| \| \| \| \| \| \| \|	Protect the PHB's list of PE. Probably not needed as long as it was populated during PHB creation, but it feels right and will become required once we can add/remove opencapi devices on hotplug. Reviewed-by: Andrew Donnellan <ajd@linux.ibm.com> Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20191121134918.7155-3-fbarrat@linux.ibm.com
*	powerpc/powernv/ioda: Fix ref count for devices with their own PE	Frederic Barrat	2020-01-23	1	-7/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The pci_dn structure used to store a pointer to the struct pci_dev, so taking a reference on the device was required. However, the pci_dev pointer was later removed from the pci_dn structure, but the reference was kept for the npu device. See commit 902bdc57451c ("powerpc/powernv/idoa: Remove unnecessary pcidev from pci_dn"). We don't need to take a reference on the device when assigning the PE as the struct pnv_ioda_pe is cleaned up at the same time as the (physical) device is released. Doing so prevents the device from being released, which is a problem for opencapi devices, since we want to be able to remove them through PCI hotplug. Now the ugly part: nvlink npu devices are not meant to be released. Because of the above, we've always leaked a reference and simply removing it now is dangerous and would likely require more work. There's currently no release device callback for nvlink devices for example. So to be safe, this patch leaks a reference on the npu device, but only for nvlink and not opencapi. Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com> Reviewed-by: Andrew Donnellan <ajd@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20191121134918.7155-2-fbarrat@linux.ibm.com
*	powerpc/powernv: use resource_size	Julia Lawall	2020-01-07	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Use resource_size rather than a verbose computation on the end and start fields. The semantic patch that makes these changes is as follows: (http://coccinelle.lip6.fr/) <smpl> @@ struct resource ptr; @@ - (ptr.end - ptr.start + 1) + resource_size(&ptr) </smpl> Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/1577900990-8588-11-git-send-email-Julia.Lawall@inria.fr
*	powerpc/powernv/iov: Ensure the pdn for VFs always contains a valid PE number	Oliver O'Halloran	2020-01-06	2	-8/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	On pseries there is a bug with adding hotplugged devices to an IOMMU group. For a number of dumb reasons fixing that bug first requires re-working how VFs are configured on PowerNV. For background, on PowerNV we use the pcibios_sriov_enable() hook to do two things: 1. Create a pci_dn structure for each of the VFs, and 2. Configure the PHB's internal BARs so the MMIO range for each VF maps to a unique PE. Roughly speaking a PE is the hardware counterpart to a Linux IOMMU group since all the devices in a PE share the same IOMMU table. A PE also defines the set of devices that should be isolated in response to a PCI error (i.e. bad DMA, UR/CA, AER events, etc). When isolated all MMIO and DMA traffic to and from devicein the PE is blocked by the root complex until the PE is recovered by the OS. The requirement to block MMIO causes a giant headache because the P8 PHB generally uses a fixed mapping between MMIO addresses and PEs. As a result we need to delay configuring the IOMMU groups for device until after MMIO resources are assigned. For physical devices (i.e. non-VFs) the PE assignment is done in pcibios_setup_bridge() which is called immediately after the MMIO resources for downstream devices (and the bridge's windows) are assigned. For VFs the setup is more complicated because: a) pcibios_setup_bridge() is not called again when VFs are activated, and b) The pci_dev for VFs are created by generic code which runs after pcibios_sriov_enable() is called. The work around for this is a two step process: 1. A fixup in pcibios_add_device() is used to initialised the cached pe_number in pci_dn, then 2. A bus notifier then adds the device to the IOMMU group for the PE specified in pci_dn->pe_number. A side effect fixing the pseries bug mentioned in the first paragraph is moving the fixup out of pcibios_add_device() and into pcibios_bus_add_device(), which is called much later. This results in step 2. failing because pci_dn->pe_number won't be initialised when the bus notifier is run. We can fix this by removing the need for the fixup. The PE for a VF is known before the VF is even scanned so we can initialise pci_dn->pe_number pcibios_sriov_enable() instead. Unfortunately, moving the initialisation causes two problems: 1. We trip the WARN_ON() in the current fixup code, and 2. The EEH core clears pdn->pe_number when recovering a VF and relies on the fixup to correctly re-set it. The only justification for either of these is a comment in eeh_rmv_device() suggesting that pdn->pe_number must be set to IODA_INVALID_PE in order for the VF to be scanned. However, this comment appears to have no basis in reality. Both bugs can be fixed by just deleting the code. Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20191028085424.12006-1-oohall@gmail.com
*	powerpc/perf: Disable trace_imc pmu	Madhavan Srinivasan	2019-12-05	1	-1/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When a root user or a user with CAP_SYS_ADMIN privilege uses any trace_imc performance monitoring unit events, to monitor application or KVM threads, it may result in a checkstop (System crash). The cause is frequent switching of the "trace/accumulation" mode of the In-Memory Collection hardware (LDBAR). This patch disables the trace_imc PMU unit entirely to avoid triggering the checkstop. A future patch will reenable it at a later stage once a workaround has been developed. Fixes: 012ae244845f ("powerpc/perf: Trace imc PMU functions") Cc: stable@vger.kernel.org # v5.2+ Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com> Tested-by: Hariharan T.S. <hari@linux.ibm.com> [mpe: Add pr_info_once() so dmesg shows the PMU has been disabled] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20191118034452.9939-1-maddy@linux.vnet.ibm.com
*	powerpc/powernv: Avoid re-registration of imc debugfs directory	Anju T Sudhakar	2019-12-05	1	-23/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	export_imc_mode_and_cmd() function which creates the debugfs interface for imc-mode and imc-command, is invoked when each nest pmu units is registered. When the first nest pmu unit is registered, export_imc_mode_and_cmd() creates 'imc' directory under `/debug/powerpc/`. In the subsequent invocations debugfs_create_dir() function returns, since the directory already exists. The recent commit <c33d442328f55> (debugfs: make error message a bit more verbose), throws a warning if we try to invoke `debugfs_create_dir()` with an already existing directory name. Address this warning by making the debugfs directory registration in the opal_imc_counters_probe() function, i.e invoke export_imc_mode_and_cmd() function from the probe function. Signed-off-by: Anju T Sudhakar <anju@linux.vnet.ibm.com> Tested-by: Nageswara R Sastry <nasastry@in.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20191127072035.4283-1-anju@linux.vnet.ibm.com
*	powerpc/powernv: Disable native PCIe port management	Oliver O'Halloran	2019-11-21	1	-0/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	On PowerNV the PCIe topology is (currently) managed by the powernv platform code in Linux in cooperation with the platform firmware. Linux's native PCIe port service drivers operate independently of both and this can cause problems. The main issue is that the portbus driver will conflict with the platform specific hotplug driver (pnv_php) over ownership of the MSI used to notify the host when a hotplug event occurs. The portbus driver claims this MSI on behalf of the individual port services because the same interrupt is used for hotplug events, PMEs (on root ports), and link bandwidth change notifications. The portbus driver will always claim the interrupt even if the individual port service drivers, such as pciehp, are compiled out. The second, bigger, problem is that the hotplug port service driver fundamentally does not work on PowerNV. The platform assumes that all PCI devices have a corresponding arch-specific handle derived from the DT node for the device (pci_dn) and without one the platform will not allow a PCI device to be enabled. This problem is largely due to historical baggage, but it can't be resolved without significant re-factoring of the platform PCI support. We can fix these problems in the interim by setting the "pcie_ports_disabled" flag during platform initialisation. The flag indicates the platform owns the PCIe ports which stops the portbus driver from being registered. This does have the side effect of disabling all port services drivers that is: AER, PME, BW notifications, hotplug, and DPC. However, this is not a huge disadvantage on PowerNV since these services are either unused or handled through other means. Fixes: 66725152fb9f ("PCI/hotplug: PowerPC PowerNV PCI hotplug driver") Signed-off-by: Oliver O'Halloran <oohall@gmail.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20191118065553.30362-1-oohall@gmail.com
*	powerpc/powernv/ioda: using kfree_rcu() to simplify the code	YueHaibing	2019-11-13	1	-9/+1
\| \| \| \| \| \| \| \| \|	The callback function of call_rcu() just calls a kfree(), so we can use kfree_rcu() instead of call_rcu() + callback function. Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190711141818.18044-1-yuehaibing@huawei.com
*	powerpc/powernv: Make some symbols static	YueHaibing	2019-11-13	3	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fix sparse warnings: arch/powerpc/platforms/powernv/opal-psr.c:20:1: warning: symbol 'psr_mutex' was not declared. Should it be static? arch/powerpc/platforms/powernv/opal-psr.c:27:3: warning: symbol 'psr_attrs' was not declared. Should it be static? arch/powerpc/platforms/powernv/opal-powercap.c:20:1: warning: symbol 'powercap_mutex' was not declared. Should it be static? arch/powerpc/platforms/powernv/opal-sensor-groups.c:20:1: warning: symbol 'sg_mutex' was not declared. Should it be static? Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190702131733.44100-1-yuehaibing@huawei.com
*	powerpc/powernv/npu: Fix debugfs_simple_attr.cocci warnings	YueHaibing	2019-11-13	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Use DEFINE_DEBUGFS_ATTRIBUTE rather than DEFINE_SIMPLE_ATTRIBUTE for debugfs files. Semantic patch information: Rationale: DEFINE_SIMPLE_ATTRIBUTE + debugfs_create_file() imposes some significant overhead as compared to DEFINE_DEBUGFS_ATTRIBUTE + debugfs_create_file_unsafe(). Generated by: scripts/coccinelle/api/debugfs/debugfs_simple_attr.cocci Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/1545705876-63132-1-git-send-email-yuehaibing@huawei.com
*	Merge branch 'topic/secureboot' into next	Michael Ellerman	2019-11-13	4	-0/+147
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Merge the secureboot support, as well as the IMA changes needed to support it. From Nayna's cover letter: In order to verify the OS kernel on PowerNV systems, secure boot requires X.509 certificates trusted by the platform. These are stored in secure variables controlled by OPAL, called OPAL secure variables. In order to enable users to manage the keys, the secure variables need to be exposed to userspace. OPAL provides the runtime services for the kernel to be able to access the secure variables. This patchset defines the kernel interface for the OPAL APIs. These APIs are used by the hooks, which load these variables to the keyring and expose them to the userspace for reading/writing. Overall, this patchset adds the following support: * expose secure variables to the kernel via OPAL Runtime API interface * expose secure variables to the userspace via kernel sysfs interface * load kernel verification and revocation keys to .platform and .blacklist keyring respectively. The secure variables can be read/written using simple linux utilities cat/hexdump. For example: Path to the secure variables is: /sys/firmware/secvar/vars Each secure variable is listed as directory. $ ls -l total 0 drwxr-xr-x. 2 root root 0 Aug 20 21:20 db drwxr-xr-x. 2 root root 0 Aug 20 21:20 KEK drwxr-xr-x. 2 root root 0 Aug 20 21:20 PK The attributes of each of the secure variables are (for example: PK): $ ls -l total 0 -r--r--r--. 1 root root 4096 Oct 1 15:10 data -r--r--r--. 1 root root 65536 Oct 1 15:10 size --w-------. 1 root root 4096 Oct 1 15:12 update The "data" is used to read the existing variable value using hexdump. The data is stored in ESL format. The "update" is used to write a new value using cat. The update is to be submitted as AUTH file.
\| *	powerpc/powernv: Add OPAL API interface to access secure variable	Nayna Jain	2019-11-13	4	-0/+147
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The X.509 certificates trusted by the platform and required to secure boot the OS kernel are wrapped in secure variables, which are controlled by OPAL. This patch adds firmware/kernel interface to read and write OPAL secure variables based on the unique key. This support can be enabled using CONFIG_OPAL_SECVAR. Signed-off-by: Claudio Carvalho <cclaudio@linux.ibm.com> Signed-off-by: Nayna Jain <nayna@linux.ibm.com> Signed-off-by: Eric Richter <erichte@linux.ibm.com> [mpe: Make secvar_ops __ro_after_init, only build opal-secvar.c if PPC_SECURE_BOOT=y] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/1573441836-3632-2-git-send-email-nayna@linux.ibm.com
* \|	Merge branch 'fixes' into next	Michael Ellerman	2019-11-04	2	-17/+38
\|\ \ \| \| \| \| \| \| \| \| \| \| \| \|	Merge our fixes branch, primarily to bring in the powernv CPU hotplug warning fix.
\| * \|	powerpc/powernv: Fix CPU idle to be called with IRQs disabled	Nicholas Piggin	2019-10-29	1	-16/+37
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit e78a7614f3876 ("idle: Prevent late-arriving interrupts from disrupting offline") changes arch_cpu_idle_dead to be called with interrupts disabled, which triggers the WARN in pnv_smp_cpu_kill_self. Fix this by fixing up irq_happened after hard disabling, rather than requiring there are no pending interrupts, similarly to what was done done until commit 2525db04d1cc5 ("powerpc/powernv: Simplify lazy IRQ handling in CPU offline"). Fixes: e78a7614f3876 ("idle: Prevent late-arriving interrupts from disrupting offline") Reported-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> [mpe: Add unexpected_mask rather than checking for known bad values, change the WARN_ON() to a WARN_ON_ONCE()] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20191022115814.22456-1-npiggin@gmail.com
\| * \|	powerpc/powernv/eeh: Fix oops when probing cxl devices	Frederic Barrat	2019-10-25	1	-1/+1
\| \|/ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Recent cleanup in the way EEH support is added to a device causes a kernel oops when the cxl driver probes a device and creates virtual devices discovered on the FPGA: BUG: Kernel NULL pointer dereference at 0x000000a0 Faulting instruction address: 0xc000000000048070 Oops: Kernel access of bad area, sig: 7 [#1] ... NIP eeh_add_device_late.part.9+0x50/0x1e0 LR eeh_add_device_late.part.9+0x3c/0x1e0 Call Trace: _dev_info+0x5c/0x6c (unreliable) pnv_pcibios_bus_add_device+0x60/0xb0 pcibios_bus_add_device+0x40/0x60 pci_bus_add_device+0x30/0x100 pci_bus_add_devices+0x64/0xd0 cxl_pci_vphb_add+0xe0/0x130 [cxl] cxl_probe+0x504/0x5b0 [cxl] local_pci_probe+0x6c/0x110 work_for_cpu_fn+0x38/0x60 The root cause is that those cxl virtual devices don't have a representation in the device tree and therefore no associated pci_dn structure. In eeh_add_device_late(), pdn is NULL, so edev is NULL and we oops. We never had explicit support for EEH for those virtual devices. Instead, EEH events are reported to the (real) pci device and handled by the cxl driver. Which can then forward to the virtual devices and handle dependencies. The fact that we try adding EEH support for the virtual devices is new and a side-effect of the recent cleanup. This patch fixes it by skipping adding EEH support on powernv for devices which don't have a pci_dn structure. The cxl driver doesn't create virtual devices on pseries so this patch doesn't fix it there intentionally. Fixes: b905f8cdca77 ("powerpc/eeh: EEH for pSeries hot plug") Signed-off-by: Frederic Barrat <fbarrat@linux.ibm.com> Reviewed-by: Sam Bobroff <sbobroff@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20191016162833.22509-1-fbarrat@linux.ibm.com
* /	powerpc/powernv: Add queue mechanism for early messages	Deb McLemore	2019-10-11	1	-2/+84
\|/ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When issuing a BMC soft poweroff during IPL, the poweroff can be lost so the machine would not poweroff. This is because opal messages can be received before the opal-power code registered its notifiers. Fix it by buffering messages. If we receive a message and do not yet have a handler for that type, store the message and replay when a handler for that type is registered. Signed-off-by: Deb McLemore <debmc@linux.vnet.ibm.com> [mpe: Single unlock path in opal_message_notifier_register(), tweak comments/formatting and change log.] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/1526868278-4204-1-git-send-email-debmc@linux.vnet.ibm.com
*	KVM: PPC: Book3S HV: use smp_mb() when setting/clearing host_ipi flag	Michael Roth	2019-09-24	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	On a 2-socket Power9 system with 32 cores/128 threads (SMT4) and 1TB of memory running the following guest configs: guest A: - 224GB of memory - 56 VCPUs (sockets=1,cores=28,threads=2), where: VCPUs 0-1 are pinned to CPUs 0-3, VCPUs 2-3 are pinned to CPUs 4-7, ... VCPUs 54-55 are pinned to CPUs 108-111 guest B: - 4GB of memory - 4 VCPUs (sockets=1,cores=4,threads=1) with the following workloads (with KSM and THP enabled in all): guest A: stress --cpu 40 --io 20 --vm 20 --vm-bytes 512M guest B: stress --cpu 4 --io 4 --vm 4 --vm-bytes 512M host: stress --cpu 4 --io 4 --vm 2 --vm-bytes 256M the below soft-lockup traces were observed after an hour or so and persisted until the host was reset (this was found to be reliably reproducible for this configuration, for kernels 4.15, 4.18, 5.0, and 5.3-rc5): [ 1253.183290] rcu: INFO: rcu_sched self-detected stall on CPU [ 1253.183319] rcu: 124-....: (5250 ticks this GP) idle=10a/1/0x4000000000000002 softirq=5408/5408 fqs=1941 [ 1256.287426] watchdog: BUG: soft lockup - CPU#105 stuck for 23s! [CPU 52/KVM:19709] [ 1264.075773] watchdog: BUG: soft lockup - CPU#24 stuck for 23s! [worker:19913] [ 1264.079769] watchdog: BUG: soft lockup - CPU#31 stuck for 23s! [worker:20331] [ 1264.095770] watchdog: BUG: soft lockup - CPU#45 stuck for 23s! [worker:20338] [ 1264.131773] watchdog: BUG: soft lockup - CPU#64 stuck for 23s! [avocado:19525] [ 1280.408480] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! [ksmd:791] [ 1316.198012] rcu: INFO: rcu_sched self-detected stall on CPU [ 1316.198032] rcu: 124-....: (21003 ticks this GP) idle=10a/1/0x4000000000000002 softirq=5408/5408 fqs=8243 [ 1340.411024] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! [ksmd:791] [ 1379.212609] rcu: INFO: rcu_sched self-detected stall on CPU [ 1379.212629] rcu: 124-....: (36756 ticks this GP) idle=10a/1/0x4000000000000002 softirq=5408/5408 fqs=14714 [ 1404.413615] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! [ksmd:791] [ 1442.227095] rcu: INFO: rcu_sched self-detected stall on CPU [ 1442.227115] rcu: 124-....: (52509 ticks this GP) idle=10a/1/0x4000000000000002 softirq=5408/5408 fqs=21403 [ 1455.111787] INFO: task worker:19907 blocked for more than 120 seconds. [ 1455.111822] Tainted: G L 5.3.0-rc5-mdr-vanilla+ #1 [ 1455.111833] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1455.111884] INFO: task worker:19908 blocked for more than 120 seconds. [ 1455.111905] Tainted: G L 5.3.0-rc5-mdr-vanilla+ #1 [ 1455.111925] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1455.111966] INFO: task worker:20328 blocked for more than 120 seconds. [ 1455.111986] Tainted: G L 5.3.0-rc5-mdr-vanilla+ #1 [ 1455.111998] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1455.112048] INFO: task worker:20330 blocked for more than 120 seconds. [ 1455.112068] Tainted: G L 5.3.0-rc5-mdr-vanilla+ #1 [ 1455.112097] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1455.112138] INFO: task worker:20332 blocked for more than 120 seconds. [ 1455.112159] Tainted: G L 5.3.0-rc5-mdr-vanilla+ #1 [ 1455.112179] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1455.112210] INFO: task worker:20333 blocked for more than 120 seconds. [ 1455.112231] Tainted: G L 5.3.0-rc5-mdr-vanilla+ #1 [ 1455.112242] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1455.112282] INFO: task worker:20335 blocked for more than 120 seconds. [ 1455.112303] Tainted: G L 5.3.0-rc5-mdr-vanilla+ #1 [ 1455.112332] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1455.112372] INFO: task worker:20336 blocked for more than 120 seconds. [ 1455.112392] Tainted: G L 5.3.0-rc5-mdr-vanilla+ #1 CPUs 45, 24, and 124 are stuck on spin locks, likely held by CPUs 105 and 31. CPUs 105 and 31 are stuck in smp_call_function_many(), waiting on target CPU 42. For instance: # CPU 105 registers (via xmon) R00 = c00000000020b20c R16 = 00007d1bcd800000 R01 = c00000363eaa7970 R17 = 0000000000000001 R02 = c0000000019b3a00 R18 = 000000000000006b R03 = 000000000000002a R19 = 00007d537d7aecf0 R04 = 000000000000002a R20 = 60000000000000e0 R05 = 000000000000002a R21 = 0801000000000080 R06 = c0002073fb0caa08 R22 = 0000000000000d60 R07 = c0000000019ddd78 R23 = 0000000000000001 R08 = 000000000000002a R24 = c00000000147a700 R09 = 0000000000000001 R25 = c0002073fb0ca908 R10 = c000008ffeb4e660 R26 = 0000000000000000 R11 = c0002073fb0ca900 R27 = c0000000019e2464 R12 = c000000000050790 R28 = c0000000000812b0 R13 = c000207fff623e00 R29 = c0002073fb0ca808 R14 = 00007d1bbee00000 R30 = c0002073fb0ca800 R15 = 00007d1bcd600000 R31 = 0000000000000800 pc = c00000000020b260 smp_call_function_many+0x3d0/0x460 cfar= c00000000020b270 smp_call_function_many+0x3e0/0x460 lr = c00000000020b20c smp_call_function_many+0x37c/0x460 msr = 900000010288b033 cr = 44024824 ctr = c000000000050790 xer = 0000000000000000 trap = 100 CPU 42 is running normally, doing VCPU work: # CPU 42 stack trace (via xmon) [link register ] c00800001be17188 kvmppc_book3s_radix_page_fault+0x90/0x2b0 [kvm_hv] [c000008ed3343820] c000008ed3343850 (unreliable) [c000008ed33438d0] c00800001be11b6c kvmppc_book3s_hv_page_fault+0x264/0xe30 [kvm_hv] [c000008ed33439d0] c00800001be0d7b4 kvmppc_vcpu_run_hv+0x8dc/0xb50 [kvm_hv] [c000008ed3343ae0] c00800001c10891c kvmppc_vcpu_run+0x34/0x48 [kvm] [c000008ed3343b00] c00800001c10475c kvm_arch_vcpu_ioctl_run+0x244/0x420 [kvm] [c000008ed3343b90] c00800001c0f5a78 kvm_vcpu_ioctl+0x470/0x7c8 [kvm] [c000008ed3343d00] c000000000475450 do_vfs_ioctl+0xe0/0xc70 [c000008ed3343db0] c0000000004760e4 ksys_ioctl+0x104/0x120 [c000008ed3343e00] c000000000476128 sys_ioctl+0x28/0x80 [c000008ed3343e20] c00000000000b388 system_call+0x5c/0x70 --- Exception: c00 (System Call) at 00007d545cfd7694 SP (7d53ff7edf50) is in userspace It was subsequently found that ipi_message[PPC_MSG_CALL_FUNCTION] was set for CPU 42 by at least 1 of the CPUs waiting in smp_call_function_many(), but somehow the corresponding call_single_queue entries were never processed by CPU 42, causing the callers to spin in csd_lock_wait() indefinitely. Nick Piggin suggested something similar to the following sequence as a possible explanation (interleaving of CALL_FUNCTION/RESCHEDULE IPI messages seems to be most common, but any mix of CALL_FUNCTION and !CALL_FUNCTION messages could trigger it): CPU X: smp_muxed_ipi_set_message(): X: smp_mb() X: message[RESCHEDULE] = 1 X: doorbell_global_ipi(42): X: kvmppc_set_host_ipi(42, 1) X: ppc_msgsnd_sync()/smp_mb() X: ppc_msgsnd() -> 42 42: doorbell_exception(): // from CPU X 42: ppc_msgsync() 105: smp_muxed_ipi_set_message(): 105: smb_mb() // STORE DEFERRED DUE TO RE-ORDERING --105: message[CALL_FUNCTION] = 1 \| 105: doorbell_global_ipi(42): \| 105: kvmppc_set_host_ipi(42, 1) \| 42: kvmppc_set_host_ipi(42, 0) \| 42: smp_ipi_demux_relaxed() \| 42: // returns to executing guest \| // RE-ORDERED STORE COMPLETES ->105: message[CALL_FUNCTION] = 1 105: ppc_msgsnd_sync()/smp_mb() 105: ppc_msgsnd() -> 42 42: local_paca->kvm_hstate.host_ipi == 0 // IPI ignored 105: // hangs waiting on 42 to process messages/call_single_queue This can be prevented with an smp_mb() at the beginning of kvmppc_set_host_ipi(), such that stores to message[<type>] (or other state indicated by the host_ipi flag) are ordered vs. the store to to host_ipi. However, doing so might still allow for the following scenario (not yet observed): CPU X: smp_muxed_ipi_set_message(): X: smp_mb() X: message[RESCHEDULE] = 1 X: doorbell_global_ipi(42): X: kvmppc_set_host_ipi(42, 1) X: ppc_msgsnd_sync()/smp_mb() X: ppc_msgsnd() -> 42 42: doorbell_exception(): // from CPU X 42: ppc_msgsync() // STORE DEFERRED DUE TO RE-ORDERING -- 42: kvmppc_set_host_ipi(42, 0) \| 42: smp_ipi_demux_relaxed() \| 105: smp_muxed_ipi_set_message(): \| 105: smb_mb() \| 105: message[CALL_FUNCTION] = 1 \| 105: doorbell_global_ipi(42): \| 105: kvmppc_set_host_ipi(42, 1) \| // RE-ORDERED STORE COMPLETES -> 42: kvmppc_set_host_ipi(42, 0) 42: // returns to executing guest 105: ppc_msgsnd_sync()/smp_mb() 105: ppc_msgsnd() -> 42 42: local_paca->kvm_hstate.host_ipi == 0 // IPI ignored 105: // hangs waiting on 42 to process messages/call_single_queue Fixing this scenario would require an smp_mb() after clearing host_ipi flag in kvmppc_set_host_ipi() to order the store vs. subsequent processing of IPI messages. To handle both cases, this patch splits kvmppc_set_host_ipi() into separate set/clear functions, where we execute smp_mb() prior to setting host_ipi flag, and after clearing host_ipi flag. These functions pair with each other to synchronize the sender and receiver sides. With that change in place the above workload ran for 20 hours without triggering any lock-ups. Fixes: 755563bc79c7 ("powerpc/powernv: Fixes for hypervisor doorbell handling") # v4.0 Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com> Acked-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190911223155.16045-1-mdroth@linux.vnet.ibm.com
*	powerpc/fadump: support holes in kernel boot memory area	Hari Bathini	2019-09-14	2	-29/+30
\| \| \| \| \| \| \| \| \| \| \|	With support to copy multiple kernel boot memory regions owing to copy size limitation, also handle holes in the memory area to be preserved. Support as many as 128 kernel boot memory regions. This allows having an adequate FADump capture kernel size for different scenarios. Signed-off-by: Hari Bathini <hbathini@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/156821385448.5656.6124791213910877759.stgit@hbathini.in.ibm.com
*	powerpc/fadump: consider f/w load area	Hari Bathini	2019-09-14	2	-1/+36
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	OPAL loads kernel & initrd at 512MB offset (256MB size), also exported as ibm,opal/dump/fw-load-area. So, if boot memory size of FADump is less than 768MB, kernel memory to be exported as '/proc/vmcore' would be overwritten by f/w while loading kernel & initrd. To avoid such a scenario, enforce a minimum boot memory size of 768MB on OPAL platform and skip using FADump if a newer F/W version loads kernel & initrd above 768MB. Also, irrespective of RMA size, set the minimum boot memory size expected on pseries platform at 320MB. This is to avoid inflating the minimum memory requirements on systems with 512M/1024M RMA size. Signed-off-by: Hari Bathini <hbathini@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/156821381414.5656.1592867278535469652.stgit@hbathini.in.ibm.com
*	powerpc/opalcore: provide an option to invalidate /sys/firmware/opal/core file	Hari Bathini	2019-09-14	1	-0/+38
\| \| \| \| \| \| \| \| \|	Writing '1' to /sys/kernel/fadump_release_opalcore would release the memory held by kernel in exporting /sys/firmware/opal/core file. Signed-off-by: Hari Bathini <hbathini@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/156821380161.5656.17827032108471421830.stgit@hbathini.in.ibm.com
*	powerpc/opalcore: export /sys/firmware/opal/core for analysing opal crashes	Hari Bathini	2019-09-14	4	-57/+678
\| \| \| \| \| \| \| \| \| \| \| \|	Export /sys/firmware/opal/core file to analyze opal crashes. Since OPAL core can be generated independent of CONFIG_FA_DUMP support in kernel, add this support under a new kernel config option CONFIG_OPAL_CORE. Also, avoid code duplication by moving common code used while exporting /proc/vmcore and/or /sys/firmware/opal/core file(s). Signed-off-by: Hari Bathini <hbathini@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/156821378503.5656.3693769384945087756.stgit@hbathini.in.ibm.com
*	powerpc/fadump: add support to preserve crash data on FADUMP disabled kernel	Hari Bathini	2019-09-14	2	-0/+74
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add a new kernel config option, CONFIG_PRESERVE_FA_DUMP that ensures that crash data, from previously crash'ed kernel, is preserved. This helps in cases where FADump is not enabled but the subsequent memory preserving kernel boot is likely to process this crash data. One typical usecase for this config option is petitboot kernel. As OPAL allows registering address with it in the first kernel and retrieving it after MPIPL, use it to store the top of boot memory. A kernel that intends to preserve crash data retrieves it and avoids using memory beyond this address. Move arch_reserved_kernel_pages() function as it is needed for both FA_DUMP and PRESERVE_FA_DUMP configurations. Signed-off-by: Hari Bathini <hbathini@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/156821375751.5656.11459483669542541602.stgit@hbathini.in.ibm.com
*	powerpc/fadump: process architected register state data provided by firmware	Hari Bathini	2019-09-14	2	-7/+243
\| \| \| \| \| \| \| \| \| \| \| \|	Firmware provides architected register state data at the time of crash. Process this data and build CPU notes to append to ELF core. In case this data is missing or in unsupported format, at least append crashing CPU's register data, to have something to work with in the vmcore file. Signed-off-by: Hari Bathini <hbathini@linux.vnet.ibm.com> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/156821367702.5656.5546683836236508389.stgit@hbathini.in.ibm.com
*	powerpc/fadump: handle invalidation of crashdump and re-registraion	Hari Bathini	2019-09-14	1	-1/+11
\| \| \| \| \| \| \| \| \| \|	Make OPAL call to indicate that the dump is processed and the metadata area in OPAL can be cleared/released. Also, setup/initialize FADump for re-registration. Signed-off-by: Hari Bathini <hbathini@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/156821356046.5656.12270927048195494911.stgit@hbathini.in.ibm.com
*	powerpc/fadump: Warn before processing partial crashdump	Hari Bathini	2019-09-14	1	-0/+24
\| \| \| \| \| \| \| \| \| \|	If all kernel boot memory regions are not registered for MPIPL before system crashes, try processing the partial crashdump but warn the user before proceeding. Signed-off-by: Hari Bathini <hbathini@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/156821352793.5656.1734051341024721407.stgit@hbathini.in.ibm.com
*	powerpc/fadump: process the crashdump by exporting it as /proc/vmcore	Hari Bathini	2019-09-14	1	-2/+145
\| \| \| \| \| \| \| \| \| \|	Add support in the kernel to process the crash'ed kernel's memory preserved during MPIPL and export it as /proc/vmcore file for the userland scripts to filter and analyze it later. Signed-off-by: Hari Bathini <hbathini@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/156821351482.5656.6255805804744333073.stgit@hbathini.in.ibm.com
*	powerpc/fadump: support copying multiple kernel boot memory regions	Hari Bathini	2019-09-14	1	-7/+27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Firmware uses a 32-bit field for size while copying/backing-up memory during MPIPL. So, the maximum value that could be represented with a PAGE_SIZE aligned 32-bit field will be the maximum copy size for a region but FADump capture kernel usually needs more memory than that to be preserved to avoid running into out of memory errors. So, request firmware to copy multiple kernel boot memory regions instead of just one (which worked fine for pseries as 64-bit field was used for size there). Signed-off-by: Hari Bathini <hbathini@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/156821350193.5656.3664853158523582019.stgit@hbathini.in.ibm.com
*	powerpc/fadump: define OPAL register/un-register callback functions	Hari Bathini	2019-09-14	1	-2/+77
\| \| \| \| \| \| \| \|	Make OPAL calls to register and un-register with firmware for MPIPL. Signed-off-by: Hari Bathini <hbathini@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/156821348482.5656.13646250851483648241.stgit@hbathini.in.ibm.com
*	powerpc/fadump: reset metadata address during clean up	Hari Bathini	2019-09-14	1	-0/+10
\| \| \| \| \| \| \| \| \| \|	During kexec boot, metadata address needs to be reset to avoid running into errors interpreting stale metadata address, in case the kexec'ed kernel crashes before metadata address could be setup again. Signed-off-by: Hari Bathini <hbathini@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/156821346629.5656.10783321582005237813.stgit@hbathini.in.ibm.com
*	powerpc/fadump: register kernel metadata address with opal	Hari Bathini	2019-09-14	2	-1/+122
\| \| \| \| \| \| \| \| \| \|	OPAL allows registering address with it in the first kernel and retrieving it after MPIPL. Setup kernel metadata and register its address with OPAL to use it for processing the crash dump. Signed-off-by: Hari Bathini <hbathini@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/156821345011.5656.13567765019032928471.stgit@hbathini.in.ibm.com
*	powerpc/fadump: add fadump support on powernv	Hari Bathini	2019-09-14	2	-0/+93
\| \| \| \| \| \| \| \|	Add basic callback functions for FADump on PowerNV platform. Signed-off-by: Hari Bathini <hbathini@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/156821342072.5656.4346362203141486452.stgit@hbathini.in.ibm.com
*	powerpc/opal: add MPIPL interface definitions	Hari Bathini	2019-09-14	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \|	MPIPL is Memory Preserving IPL supported from POWER9. This enables the kernel to reset the system with memory 'preserved'. Also, it supports copying memory from a source address to some destination address during MPIPL boot. Add MPIPL interface definitions here to leverage these f/w features in adding FADump support for PowerNV platform. Signed-off-by: Hari Bathini <hbathini@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/156821340710.5656.10071829040515662624.stgit@hbathini.in.ibm.com
*	powerpc/powernv: Fix build with IOMMU_API=n	Michael Ellerman	2019-09-14	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	The builds breaks when IOMMU_API=n, eg. skiroot_defconfig: arch/powerpc/platforms/powernv/npu-dma.c:96:28: error: 'get_gpu_pci_dev_and_pe' defined but not used arch/powerpc/platforms/powernv/npu-dma.c:126:13: error: 'pnv_npu_set_window' defined but not used Fixes: b4d37a7b6934 ("powerpc/powernv: Remove unused pnv_npu_try_dma_set_bypass() function") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
*	powerpc/xive: Fix bogus error code returned by OPAL	Greg Kurz	2019-09-12	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There's a bug in skiboot that causes the OPAL_XIVE_ALLOCATE_IRQ call to return the 32-bit value 0xffffffff when OPAL has run out of IRQs. Unfortunatelty, OPAL return values are signed 64-bit entities and errors are supposed to be negative. If that happens, the linux code confusingly treats 0xffffffff as a valid IRQ number and panics at some point. A fix was recently merged in skiboot: e97391ae2bb5 ("xive: fix return value of opal_xive_allocate_irq()") but we need a workaround anyway to support older skiboots already in the field. Internally convert 0xffffffff to OPAL_RESOURCE which is the usual error returned upon resource exhaustion. Cc: stable@vger.kernel.org # v4.12+ Signed-off-by: Greg Kurz <groug@kaod.org> Reviewed-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/156821713818.1985334.14123187368108582810.stgit@bahia.lan
*	powerpc/powernv: Add new opal message type	Vasant Hegde	2019-09-12	1	-1/+7
\| \| \| \| \| \| \| \| \| \| \| \| \|	We have OPAL_MSG_PRD message type to pass prd related messages from OPAL to `opal-prd`. It can handle messages upto 64 bytes. We have a requirement to send bigger than 64 bytes of data from OPAL to `opal-prd`. Lets add new message type (OPAL_MSG_PRD2) to pass bigger data. Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> [mpe: Make the error string clear that it's the PRD2 event that failed] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190826065701.8853-2-hegdevasant@linux.vnet.ibm.com
*	powerpc/powernv: Enhance opal message read interface	Vasant Hegde	2019-09-12	1	-10/+21
\| \| \| \| \| \| \| \| \| \|	Use "opal-msg-size" device tree property to allocate memory for "opal_msg". Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com> [mpe: s/uint32_t/u32/ and mark opal_msg_size as __ro_after_init] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190826065701.8853-1-hegdevasant@linux.vnet.ibm.com