summaryrefslogtreecommitdiffstats
path: root/doc/release-notes
diff options
context:
space:
mode:
authorStewart Smith <stewart@linux.ibm.com>2018-07-11 18:26:16 +1000
committerStewart Smith <stewart@linux.ibm.com>2018-07-11 18:27:41 +1000
commit452998f4be5973d8884e3db5aa362bf40e11467a (patch)
tree789bf277c3df2dde1bdebdde02ec1d17fc89dff7 /doc/release-notes
parent4949e4417d0eabf327f4c84cb87a619ce4af3730 (diff)
downloadtalos-skiboot-452998f4be5973d8884e3db5aa362bf40e11467a.tar.gz
talos-skiboot-452998f4be5973d8884e3db5aa362bf40e11467a.zip
skiboot 6.1 release notes
Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
Diffstat (limited to 'doc/release-notes')
-rw-r--r--doc/release-notes/skiboot-6.1.rst651
1 files changed, 651 insertions, 0 deletions
diff --git a/doc/release-notes/skiboot-6.1.rst b/doc/release-notes/skiboot-6.1.rst
new file mode 100644
index 00000000..45d87e4a
--- /dev/null
+++ b/doc/release-notes/skiboot-6.1.rst
@@ -0,0 +1,651 @@
+.. _skiboot-6.1:
+
+skiboot-6.1
+===========
+
+skiboot v6.1 was released on Wednesday July 11th 2018. It is the first
+release of skiboot 6.1, which is the new stable release of skiboot
+following the 6.0 release, first released May 11th 2018.
+
+Skiboot 6.1 is the basis for op-build v2.1 and contains all bug fixes as
+of :ref:`skiboot-6.0.5`, and :ref:`skiboot-5.4.9` (the currently maintained
+stable releases). We expect further stable releases in the 6.0.x and 5.4.x
+series, while we do not expect to do any stable releases of 6.1.x.
+
+This final 6.1 release follows a single release candidate release, as this
+cycle we have been rather quiet, with mainly cleanup and bug fix patches
+going in.
+
+For how the skiboot stable releases work, see :ref:`stable-rules` for details.
+
+Over skiboot-6.0, we have the following changes:
+
+General changes and bug fixes
+-----------------------------
+
+Since :ref:`skiboot-6.1-rc1`:
+
+- slw: Fix trivial typo in debug message
+- vpd: Add vendor property to processor node
+
+ Processor FRU vpd doesn't contain vendor detail. We have to parse
+ module VPD to get vendor detail.
+
+- vpd: Sanitize VPD data
+
+ On OpenPower system, VPD keyword size tells us the maximum size of the data.
+ But they fill trailing end with space (0x20) instead of NULL. Also spec
+ doesn't stop user to have space (0x20) within actual data.
+
+ This patch discards trailing spaces before populating device tree.
+- core: always flush console before stopping
+
+ This catches a few cases (e.g., fast reboot failure messages) that
+ don't always make it to the console before the machine is rebooted.
+- core/cpu: parallelise global CPU register setting jobs
+
+ On a 176 thread system, before: ::
+
+ [ 122.319923233,5] OPAL: Switch to big-endian OS
+ [ 126.317897467,5] OPAL: Switch to little-endian OS
+
+ after: ::
+
+ [ 212.439299889,5] OPAL: Switch to big-endian OS
+ [ 212.469323643,5] OPAL: Switch to little-endian OS
+- init, occ: Initialise OCC earlier on BMC systems
+
+ We need to use the OCC to obtain presence data for the SXM2 slots on
+ Witherspoon systems. This is needed to determine device type for NVLink
+ GPUs and OpenCAPI devices which can be plugged into the same slot. Support
+ for this will be implemented in a future patch.
+
+ Currently, OCC initialisation is done just before handing over to Linux,
+ which is well after NPU probe. On FSP systems, OCC boot starts very late,
+ so we wait until the last possible moment to initialise the skiboot side in
+ order to give it the maximum time to boot. On BMC systems, OCC boot starts
+ earlier, so there aren't any issues in moving it earlier in the skiboot
+ init sequence.
+
+ When running on a BMC machine, call occ_pstates_init() as early as
+ possible in the init sequence. On FSP machines, continue to call it from
+ its current location.
+
+Since :ref:`skiboot-6.0`:
+
+- GCC8 build fixes
+- Add prepare_hbrt_update to hbrt interfaces
+
+ Add placeholder support for prepare_hbrt_update call into
+ hostboot runtime (opal-prd) code. This interface is only
+ called as part of a concurrent code update on a FSP based
+ system.
+- cpu: Clear PCR SPR in opal_reinit_cpus()
+
+ Currently if Linux boots with a non-zero PCR, things can go bad where
+ some early userspace programs can take illegal instructions. This is
+ being fixed in Linux, but in the mean time, we should cleanup in
+ skiboot also.
+- pci: Fix PCI_DEVICE_ID()
+
+ The vendor ID is 16 bits not 8. This error leaves the top of the vendor
+ ID in the bottom bits of the device ID, which resulted in e.g. a failure
+ to run the PCI quirk for the AST VGA device.
+- Quieten console output on boot
+
+ We print out a whole bunch of things on boot, most of which aren't
+ interesting, so we should *not* print them instead.
+
+ Printing things like what CPUs we found and what PCI devices we found
+ *are* useful, so continue to do that. But we don't need to splat out
+ a bunch of things that are always going to be true.
+- core/console: fix deadlock when printing with console lock held
+
+ Some debugging options will print while the console lock is held,
+ which is why the console lock is taken as a recursive lock.
+ However console_write calls __flush_console, which will drop and
+ re-take the lock non-recursively in some cases.
+
+ Just set con_need_flush and return from __flush_console if we are
+ holding the console lock already.
+
+ This stack usage message (taken with this patch applied) could lead
+ to a deadlock without this: ::
+
+ CPU 0000 lowest stack mark 11768 bytes left pc=300cb808 token=0
+ CPU 0000 Backtrace:
+ S: 0000000031c03370 R: 00000000300cb808 .list_check_node+0x1c
+ S: 0000000031c03410 R: 00000000300cb910 .list_check+0x38
+ S: 0000000031c034b0 R: 00000000300190ac .try_lock_caller+0xb8
+ S: 0000000031c03540 R: 00000000300192e0 .lock_caller+0x80
+ S: 0000000031c03600 R: 0000000030012c70 .__flush_console+0x134
+ S: 0000000031c036d0 R: 00000000300130cc .console_write+0x68
+ S: 0000000031c03780 R: 00000000300347bc .vprlog+0xc8
+ S: 0000000031c03970 R: 0000000030034844 ._prlog+0x50
+ S: 0000000031c03a00 R: 00000000300364a4 .log_simple_error+0x74
+ S: 0000000031c03b90 R: 000000003004ab48 .occ_pstates_init+0x184
+ S: 0000000031c03d50 R: 000000003001480c .load_and_boot_kernel+0x38c
+ S: 0000000031c03e30 R: 000000003001571c .main_cpu_entry+0x62c
+ S: 0000000031c03f00 R: 0000000030002700 boot_entry+0x1c0
+- opal-prd: Do not error out on first failure for soft/hard offline.
+
+ The memory errors (CEs and UEs) that are detected as part of background
+ memory scrubbing are reported by PRD asynchronously to opal-prd along with
+ affected memory ranges. hservice_memory_error() converts these ranges into
+ page granularity before hooking up them to soft/hard offline-ing
+ infrastructure.
+
+ But the current implementation of hservice_memory_error() does not hookup
+ all the pages to soft/hard offline-ing if any of the page offline action
+ fails. e.g hard offline can fail for:
+
+ - Pages that are not part of buddy managed pool.
+ - Pages that are reserved by kernel using memblock_reserved()
+ - Pages that are in use by kernel.
+
+ But for the pages that are in use by user space application, the hard
+ offline marks the page as hwpoison, sends SIGBUS signal to kill the
+ affected application as recovery action and returns success.
+
+ Hence, It is possible that some of the pages in that memory range are in
+ use by application or free. By stopping on first error we loose the
+ opportunity to hwpoison the subsequent pages which may be free or in use by
+ application. This patch fixes this issue.
+- libflash/blocklevel_write: Fix missing error handling
+
+ Caught by scan-build, we seem to trap the errors in rc, but
+ not take any recovery action during blocklevel_write.
+
+I2C
+^^^
+- p8-i2c: fix wrong request status when a reset is needed
+
+ If the bus is found in error state when starting a new request, the
+ engine is reset and we enter recovery. However, once complete, the
+ reset operation shows a status of complete in the status register. So
+ any badly-timed called to check_status() will think the current top
+ request is complete, even though it hasn't run yet.
+
+ So don't update any request status while we are in recovery, as
+ nothing useful for the request is supposed to happen in that state.
+- p8-i2c: Remove force reset
+
+ Force reset was added as an attempt to work around some issues with TPM
+ devices locking up their I2C bus. In that particular case the problem
+ was that the device would hold the SCL line down permanently due to a
+ device firmware bug. The force reset doesn't actually do anything to
+ alleviate the situation here, it just happens to reset the internal
+ master state enough to make the I2C driver appear to work until
+ something tries to access the bus again.
+
+ On P9 systems with secure boot enabled there is the added problem
+ of the "diagostic mode" not being supported on I2C masters A,B,C and
+ D. Diagnostic mode allows the SCL and SDA lines to be driven directly
+ by software. Without this force reset is impossible to implement.
+
+ This patch removes the force reset functionality entirely since:
+
+ a) it doesn't do what it's supposed to, and
+ b) it's butt ugly code
+
+ Additionally, turn p8_i2c_reset_engine() into p8_i2c_reset_port().
+ There's no need to reset every port on a master in response to an
+ error that occurred on a specific port.
+- libstb/i2c-driver: Bump max timeout
+
+ We have observed some TPMs clock streching the I2C bus for signifigant
+ amounts of time when processing commands. The same TPMs also have
+ errata that can result in permernantly locking up a bus in response to
+ an I2C transaction they don't understand. Using an excessively long
+ timeout to prevent this in the field.
+- hdata: Add TPM timeout workaround
+
+ Set the default timeout for any bus containing a TPM to one second. This
+ is needed to work around a bug in the firmware of certain TPMs that will
+ clock strech the I2C port the for up to a second. Additionally, when the
+ TPM is clock streching it responds to a STOP condition on the bus by
+ bricking itself. Clearing this error requires a hard power cycle of the
+ system since the TPM is powered by standby power.
+- p8-i2c: Allow a per-port default timeout
+
+ Add support for setting a default timeout for the I2C port to the
+ device-tree. This is consumed by skiboot.
+
+IPMI Watchdog
+^^^^^^^^^^^^^
+- ipmi-watchdog: Support handling re-initialization
+
+ Watchdog resets can return an error code from the BMC indicating that
+ the BMC watchdog was not initialized. Currently we abort skiboot due to
+ a missing error handler. This patch implements handling
+ re-initialization for the watchdog, automatically saving the last
+ watchdog set values and re-issuing them if needed.
+- ipmi-watchdog: The stop action should disable reset
+
+ Otherwise it is possible for the reset timer to elapse and trigger the
+ watchdog to wake back up. This doesn't affect the behavior of the
+ system since we are providing a NONE action to the BMC. However we would
+ like to avoid the action from taking place if possible.
+- ipmi-watchdog: Add a flag to determine if we are still ticking
+
+ This makes it easier for future changes to ensure that the watchdog
+ stops ticking and doesn't requeue itself for execution in the
+ background. This way it is safe for resets to be performed after the
+ ticks are assumed to be stopped and it won't start the timer again.
+- ipmi-watchdog: (prepare for) not disabling at shutdown
+
+ The op-build linux kernel has been configured to support the ipmi
+ watchdog. This driver will always handle the watchdog by either leaving
+ it enabled if configured, or by disabling it during module load if no
+ configuration is provided. This increases the coverage of the watchdog
+ during the boot process. The watchdog should no longer be disabled at
+ any point during skiboot execution.
+
+ We're not enabling this by default yet as people can (and do, at least in
+ development) mix and match old BOOTKERNEL with new skiboot and we don't
+ want to break that too obviously.
+- ipmi-watchdog: Don't reset the watchdog twice
+
+ There is no clarification for why this change was needed, but presumably
+ this is due to a buggy BMC implementation where the Watchdog Set command
+ was processed concurrently or after the initial Watchdog Reset. This
+ inversion would cause the watchdog to stop since the DONT_STOP bit was
+ not set. Since we are now using the DONT_STOP bit during initialization,
+ the watchdog should not be stopped even if an inversion occurs.
+- ipmi-watchdog: Make it possible to set DONT_STOP
+
+ The IPMI standard supports setting a DONT_STOP bit during an Watchdog
+ Set operation. Most of the time we don't want to stop the Watchdog when
+ updating the settings so we should be using this bit. This patch makes
+ it possible for callers of set_wdt to prevent the watchdog from being
+ stopped. This only changes the behavior of the watchdog during the
+ initial settings update when initializing skiboot. The watchdog is no
+ longer disabled and then immediately re-enabled.
+- ipmi-watchdog: WD_POWER_CYCLE_ACTION -> WD_RESET_ACTION
+
+ The IPMI specification denotes that action 0x1 is Host Reset and 0x3 is
+ Host Power Cycle. Use the correct name for Reset in our watchdog code.
+
+
+POWER8 platforms
+----------------
+
+- astbmc: Enable mbox depending on scratch reg
+
+ P8 boxes can opt in for mbox pnor support if they set the scratch
+ register bit to indicate it is supported.
+
+Simulator platforms
+-------------------
+
+Since :ref:`skiboot-6.1-rc1`:
+
+- pmem: volatile bindings for the poorly enabled
+
+ PMEM_DISK bindings were added, but they rely on a rather
+ recent mmap feature. This patch steals from those bindings
+ to add volatile bindings. I've used these bindings with
+ PMEM_VOLATILE to launch an instance with the publicly
+ available systemsim-p9. The bindings are volatile and one
+ should not expect any data to be saved/retrieved.
+
+Since :ref:`skiboot-6.0`:
+
+- plat/qemu: add PNOR support
+
+ To access the PNOR, OPAL/skiboot drives the BMC SPI controller using
+ the iLPC2AHB device of the BMC SuperIO controller and accesses the
+ flash contents using the LPC FW address space on which the PNOR is
+ remapped.
+
+ The QEMU PowerNV machine now integrates such models (SuperIO
+ controller, iLPC2AHB device) and also a pseudo Aspeed SoC AHB memory
+ space populated with the SPI controller registers (same model as for
+ ARM). The AHB window giving access to the contents of the BMC SPI
+ controller flash modules is mapped on the LPC FW address space.
+
+ The change should be compatible for machine without PNOR support.
+- external/mambo: Add support for readline if it exists
+
+ Add support for tclreadline package if it is present.
+ This patch loads the package and uses it when the
+ simulation stops for any reason.
+
+
+FSP based platforms
+-------------------
+
+- Disable fast reboot on FSP IPL side change
+
+ If FSP changes next IPL side, then disable fast reboot.
+
+ sample output: ::
+
+ [ 620.196442259,5] FSP: Got sysparam update, param ID 0xf0000007
+ [ 620.196444501,5] CUPD: FW IPL side changed. Disable fast reboot
+ [ 620.196445389,5] CUPD: Next IPL side : perm
+- fsp/console: Always establish OPAL console API backend
+
+ Currently we only call set_opal_console() to establish the backend
+ used by the OPAL console API if we find at least one FSP serial
+ port in HDAT.
+
+ On systems where there is none (IPMI only), we fail to set it,
+ causing the console code to try to use the dummy console causing
+ an assertion failure during boot due to clashing on the device-tree
+ node names.
+
+ So always set it if an FSP is present
+
+AST BMC based platforms
+-----------------------
+
+- AMI BMC: use 0x3a as OEM command
+
+ The 0x3a OEM command is for IBM commands, while 0x32 was for AMI ones.
+ Sometime in the P8 timeframe, AMI BMCs were changed to listen for our
+ commands on either 0x32 or 0x3a. Since 0x3a is the direction forward,
+ we'll use that, as P9 machines with AMI BMCs probably also want these
+ to work, and let's not bet that 0x32 will continue to be okay.
+- astbmc: Set romulus BMC type to OpenBMC
+- platform/astbmc: Do not delete compatible property
+
+ P9 onwards OPAL is building device tree for BMC based system using
+ HDAT. We are populating bmc/compatible node with bmc version. Hence
+ do not delete this property.
+
+Utilities
+---------
+- external/xscom-utils: Add python library for xscom access
+
+ Patch adds a simple python library module for xscom access.
+ It directly manipulate the '/access' file for scom read
+ and write from debugfs 'scom' directory.
+
+ Example on how to generate a getscom using this module:
+
+ .. code-block:: python
+
+ from adu_scoms import *
+ getscom = GetSCom()
+ getscom.parse_args()
+ getscom.run_command()
+
+ Sample output for above getscom.py:
+
+ .. code-block:: console
+
+ # ./getscom.py -l
+ Chip ID | Rev | Chip type
+ ---------|-------|-----------
+ 00000008 | DD2.0 | P9 (Nimbus) processor
+ 00000000 | DD2.0 | P9 (Nimbus) processor
+- ffspart: Don't require user to create blank partitions manually
+
+ Add '--allow-empty' which allows the filename for a given partition to
+ be blank. If set ffspart will set that part of the PNOR file 'blank' and
+ set ECC bits if required.
+ Without this option behaviour is unchanged and ffspart will return an
+ error if it can not find the partition file.
+- pflash: Use correct prefix when installing
+
+ pflash uses lowercase prefix when running make install in it's
+ direcetory, but uppercase PREFIX when running it in shared. Use
+ lowercase everywhere.
+
+ With this the OpenBMC bitbake recipie can drop an out of tree patch it's
+ been carrying for years.
+
+
+POWER9
+------
+
+Since :ref:`skiboot-6.1-rc1`:
+
+- occ: sensors: Fix the size of the phandle array 'sensors' in DT
+
+ Fixes: 99505c03f493 (present in v5.10-rc4)
+- phb4: Delay training till after PERST is deasserted
+
+ This helps some cards train on the second PERST (ie fast-reboot). The
+ reason is not clear why but it helps, so YOLO!
+
+Since :ref:`skiboot-6.0`:
+
+- occ-sensor: Avoid using uninitialised struct cpu_thread
+
+ When adding the sensors in occ_sensors_init, if the type is not
+ OCC_SENSOR_LOC_CORE, then the loop to find 'c' will not be executed.
+ Then c->pir is used for both of the the add_sensor_node calls below.
+
+ This provides a default value of 0 instead.
+- NX: Add NX coprocessor init opal call
+
+ The read offset (4:11) in Receive FIFO control register is incremented
+ by FIFO size whenever CRB read by NX. But the index in RxFIFO has to
+ match with the corresponding entry in FIFO maintained by VAS in kernel.
+ VAS entry is reset to 0 when opening the receive window during driver
+ initialization. So when NX842 is reloaded or in kexec boot, possibility
+ of mismatch between RxFIFO control register and VAS entries in kernel.
+ It could cause CRB failure / timeout from NX.
+
+ This patch adds nx_coproc_init opal call for kernel to initialize
+ readOffset (4:11) and Queued (15:23) in RxFIFO control register.
+- SLW: Remove stop1_lite and stop2_lite
+
+ stop1_lite has been removed since it adds no additional benefit
+ over stop0_lite. stop2_lite has been removed since currently it adds
+ minimal benefit over stop2. However, the benefit is eclipsed by the time
+ required to ungate the clocks
+
+ Moreover, Lite states don't give up the SMT resources, can potentially
+ have a performance impact on sibling threads.
+
+ Since current OSs (Linux) aren't smart enough to make good decisions
+ with these stop states, we're (temporarly) removing them from what
+ we expose to the OS, the idea being to bring them back in a new
+ DT representation so that only an OS that knows what to do will
+ do things with them.
+- cpu: Use STOP1 on POWER9 for idle/sleep inside OPAL
+
+ The current code requests STOP3, which means it gets STOP2 in practice.
+
+ STOP2 has proven to occasionally be unreliable depending on FW
+ version and chip revision, it also requires a functional CME,
+ so instead, let's use STOP1. The difference is rather minimum
+ for something that is only used a few seconds during boot.
+
+NPU2 (NVLink2 and OpenCAPI)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Since :ref:`skiboot-6.1-rc1`:
+
+- capi: Select the correct IODA table entry for the mbt cache.
+
+ With the current code, the capi mmio window is not correctly configured
+ in the IODA table entry. The first entry (generally the non-prefetchable
+ BAR) is overwrriten.
+ This patch sets the capi window bar at the right place.
+- npu2/hw-procedures: Fence bricks via NTL instead of MISC
+
+ There are a couple of places we can set/unset fence for a brick:
+
+ 1. MISC register: NPU2_MISC_FENCE_STATE
+ 2. NTL register for the brick: NPU2_NTL_MISC_CFG1(ndev)
+
+ Recent testing of ATS in combination with GPU reset has exposed a side
+ effect of using (1); if fence is set for all six bricks, it triggers a
+ sticky nmmu latch which prevents the NPU from getting ATR responses.
+ This manifests as a hang in the tests.
+
+ We have npu2_dev_fence_brick() which uses (1), and only two calls to it.
+ Replace the call which sets fence with a write to (2). Remove the
+ corresponding unset call entirely. It's unneeded because the procedures
+ already do a progression from full fence to half to idle using (2).
+
+- phb4/capp: Calculate STQ/DMA read engines based on link-width for PEC
+
+ Presently in CAPI mode the number of STQ/DMA-read engines allocated on
+ PEC2 for CAPP is fixed to 6 and 0-30 respectively irrespective of the
+ PCI link width. These values are only suitable for x8 cards and
+ quickly run out if a x16 card is plugged to a PEC2 attached slot. This
+ usually manifests as CAPP reporting TLBI timeout due to these messages
+ getting stalled due to insufficient STQs.
+
+ To fix this we update enable_capi_mode() to check if PEC2 chiplet is
+ in x16 mode and if yes then we allocate 4/0-47 STQ/DMA-read engines
+ for the CAPP traffic.
+
+ Fixes: 37ea3cfdc852 (present in v5.7-rc1)
+- npu2: Use same compatible string for NVLink and OpenCAPI link nodes in device tree
+
+ Currently, we distinguish between NPU links for NVLink devices and OpenCAPI
+ devices through the use of two different compatible strings - ibm,npu-link
+ and ibm,npu-link-opencapi.
+
+ As we move towards supporting configurations with both NVLink and OpenCAPI
+ devices behind a single NPU, we need to detect the device type as part of
+ presence detection, which can't happen until well after the point where the
+ HDAT or platform code has created the NPU device tree nodes. Changing a
+ node's compatible string after it's been created is a bit ugly, so instead
+ we should move the device type to a new property which we can add to the
+ node later on.
+
+ Get rid of the ibm,npu-link-opencapi compatible string, add a new
+ ibm,npu-link-type property, and a helper function to check the link type.
+ Add an "unknown" device type in preparation for later patches to detect
+ device type dynamically.
+
+ These device tree bindings are entirely internal to skiboot and are not
+ consumed directly by Linux, so this shouldn't break anything (other than
+ internal BML lab environments).
+- occ: Add support for GPU presence detection
+
+ On the Witherspoon platform, we need to distinguish between NVLink GPUs and
+ OpenCAPI accelerators. In order to do this, we first need to find out
+ whether the SXM2 socket is populated.
+
+ On Witherspoon, the SXM2 socket's presence detection pin is only visible
+ via I2C from the APSS, and thus can only be exposed to the host via the
+ OCC. The OCC, per OCC Firmware Interface Specification for POWER9 version
+ 0.22, now exposes this to skiboot through a field in the dynamic data
+ shared memory.
+
+ Add the necessary dynamic data changes required to read the version and
+ GPU presence fields. Add a function, occ_get_gpu_presence(), that can be
+ used to check GPU presence.
+
+ If the OCC isn't reporting presence (old OCC firmware, or some other
+ reason), we default to assuming there is a device present and wait until
+ link training to fail.
+
+ This will be used in later patches to fix up the NPU2 probe path for
+ OpenCAPI support on Witherspoon.
+- hw/npu2, core/hmi: Use NPU instead of NPU2 as log message prefix
+
+ The NPU2{DBG,INF,ERR} macros use "NPU%d" as a prefix to identify messages
+ relating to a particular NPU.
+
+ It's slightly confusing to have per-NPU messages prefixed with "NPU0" or
+ "NPU1" and NPU-generic messages prefixed with "NPU2". On some future system
+ we could potentially have a NPU #2 in which case it'd be really confusing.
+
+ Use NPU rather than NPU2 for NPU-generic log messages. There's no risk of
+ confusion with the original npu.c code since that's only for P8.
+
+Since :ref:`skiboot-6.0`:
+
+- npu2: Reset NVLinks on hot reset
+
+ This effectively fences GPU RAM on GPU reset so the host system
+ does not have to crash every time we stop a KVM guest with a GPU
+ passed through.
+- npu2-opencapi: reduce number of retries to train the link
+
+ We've been reliably training the opencapi link on the first attempt
+ for quite a while. Furthermore, if it doesn't train on the first
+ attempt, retries haven't been that useful. So let's reduce the number
+ of attempts we do to train the link.
+
+ 2 retries = 3 attempts to train.
+
+ Each (failed) training sequence costs about 3 seconds.
+- opal/hmi: Display correct chip id while printing NPU FIRs.
+
+ HMIs for NPU xstops are broadcasted to all chips. All cores on all the
+ chips receive HMI. HMI handler correctly identifies and extracts the
+ NPU FIR details from affected chip, but while printing FIR data it
+ prints chip id and location code details of this_cpu()->chip_id which
+ may not be correct. This patch fixes this issue.
+- npu2-opencapi: Fix link state to report link down
+
+ The PHB callback 'get_link_state' is always reporting the link width,
+ irrespective of the link status and even when the link is down. It is
+ causing too much work (and failures) when the PHB is probed during pci
+ init.
+ The fix is to look at the link status first and report the link as
+ down when appropriate.
+- npu2-opencapi: Cleanup traces printed during link training
+
+ Now that links may train in parallel, traces shown during training can
+ be all mixed up. So add a prefix to all the traces to clearly identify
+ the chip and link the trace refers to: ::
+
+ OCAPI[<chip id>:<link id>]: this is a very useful message
+
+ The lower-level hardware procedures (npu2-hw-procedures.c) also print
+ traces which would need work. But that code is being reworked to be
+ better integrated with opencapi and nvidia, so leave it alone for now.
+- npu2-opencapi: Train links on fundamental reset
+
+ Reorder our link training steps so that they are executed on
+ fundamental reset instead of during the initial setup. Skiboot always
+ call a fundamental reset on all the PHBs during pci init.
+
+ It is done through a state machine, similarly to what is done for
+ 'real' PHBs.
+
+ This is the first step for a longer term goal to be able to trigger an
+ adapter reset from linux. We'll need the reset callbacks of the PHB to
+ be defined. We have to handle the various delays differently, since a
+ linux thread shouldn't stay stuck waiting in opal for too long.
+- npu2-opencapi: Rework adapter reset
+
+ Rework a bit the code to reset the opencapi adapter:
+
+ - make clearer which i2c pin is resetting which device
+ - break the reset operation in smaller chunks. This is really to
+ prepare for a future patch.
+
+ No functional changes.
+- npu2-opencapi: Use presence detection
+
+ Presence detection is not part of the opencapi specification. So each
+ platform may choose to implement it the way it wants.
+
+ All current platforms implement it through an i2c device where we can
+ query a pin to know if a device is connected or not. ZZ and Zaius have
+ a similar design and even use the same i2c information and pin
+ numbers.
+ However, presence detection on older ZZ planar (older than v4) doesn't
+ work, so we don't activate it for now, until our lab systems are
+ upgraded and it's better tested.
+
+ Presence detection on witherspoon is still being worked on. It's
+ shaping up to be quite different, so we may have to revisit the topic
+ in a later patch.
+
+Testing and CI
+--------------
+
+Since :ref:`skiboot-6.1-rc1`:
+
+- test/qemu: start building qemu again, and use our built qemu for tests
+
+ We need to use QEMU_BIN rather than QEMU as the makefiles define
+ QEMU already.
+- opal-ci: qemu: Use the powernv-3.0 branch
+
+ This is based off the current development version of Qemu, and
+ importantly it contains the patch that allows skiboot and Linux to clear
+ the PCR that we require to boot.
OpenPOWER on IntegriCloud