skiboot 6.1 release notes

Signed-off-by: Stewart Smith <stewart@linux.ibm.com>
author: Stewart Smith <stewart@linux.ibm.com> 2018-07-11 18:26:16 +1000
committer: Stewart Smith <stewart@linux.ibm.com> 2018-07-11 18:27:41 +1000
commit: 452998f4be5973d8884e3db5aa362bf40e11467a (patch)
tree: 789bf277c3df2dde1bdebdde02ec1d17fc89dff7 /doc/release-notes
parent: 4949e4417d0eabf327f4c84cb87a619ce4af3730 (diff)
download: talos-skiboot-452998f4be5973d8884e3db5aa362bf40e11467a.tar.gz
talos-skiboot-452998f4be5973d8884e3db5aa362bf40e11467a.zip
1 files changed, 651 insertions, 0 deletions
diff --git a/doc/release-notes/skiboot-6.1.rst b/doc/release-notes/skiboot-6.1.rst
new file mode 100644
index 00000000..45d87e4a
--- /dev/null
+++ b/doc/release-notes/skiboot-6.1.rst
@@ -0,0 +1,651 @@
+.. _skiboot-6.1:
+
+skiboot-6.1
+===========
+
+skiboot v6.1 was released on Wednesday July 11th 2018. It is the first
+release of skiboot 6.1, which is the new stable release of skiboot
+following the 6.0 release, first released May 11th 2018.
+
+Skiboot 6.1 is the basis for op-build v2.1 and contains all bug fixes as
+of :ref:`skiboot-6.0.5`, and :ref:`skiboot-5.4.9` (the currently maintained
+stable releases). We expect further stable releases in the 6.0.x and 5.4.x
+series, while we do not expect to do any stable releases of 6.1.x.
+
+This final 6.1 release follows a single release candidate release, as this
+cycle we have been rather quiet, with mainly cleanup and bug fix patches
+going in.
+
+For how the skiboot stable releases work, see :ref:`stable-rules` for details.
+
+Over skiboot-6.0, we have the following changes:
+
+General changes and bug fixes
+-----------------------------
+
+Since :ref:`skiboot-6.1-rc1`:
+
+- slw: Fix trivial typo in debug message
+- vpd: Add vendor property to processor node
+
+  Processor FRU vpd doesn't contain vendor detail. We have to parse
+  module VPD to get vendor detail.
+
+- vpd: Sanitize VPD data
+
+  On OpenPower system, VPD keyword size tells us the maximum size of the data.
+  But they fill trailing end with space (0x20) instead of NULL. Also spec
+  doesn't stop user to have space (0x20) within actual data.
+
+  This patch discards trailing spaces before populating device tree.
+- core: always flush console before stopping
+
+  This catches a few cases (e.g., fast reboot failure messages) that
+  don't always make it to the console before the machine is rebooted.
+- core/cpu: parallelise global CPU register setting jobs
+
+  On a 176 thread system, before: ::
+
+    [  122.319923233,5] OPAL: Switch to big-endian OS
+    [  126.317897467,5] OPAL: Switch to little-endian OS
+
+  after: ::
+
+    [  212.439299889,5] OPAL: Switch to big-endian OS
+    [  212.469323643,5] OPAL: Switch to little-endian OS
+- init, occ: Initialise OCC earlier on BMC systems
+
+  We need to use the OCC to obtain presence data for the SXM2 slots on
+  Witherspoon systems. This is needed to determine device type for NVLink
+  GPUs and OpenCAPI devices which can be plugged into the same slot. Support
+  for this will be implemented in a future patch.
+
+  Currently, OCC initialisation is done just before handing over to Linux,
+  which is well after NPU probe. On FSP systems, OCC boot starts very late,
+  so we wait until the last possible moment to initialise the skiboot side in
+  order to give it the maximum time to boot. On BMC systems, OCC boot starts
+  earlier, so there aren't any issues in moving it earlier in the skiboot
+  init sequence.
+
+  When running on a BMC machine, call occ_pstates_init() as early as
+  possible in the init sequence. On FSP machines, continue to call it from
+  its current location.
+
+Since :ref:`skiboot-6.0`:
+
+- GCC8 build fixes
+- Add prepare_hbrt_update to hbrt interfaces
+
+  Add placeholder support for prepare_hbrt_update call into
+  hostboot runtime (opal-prd) code.  This interface is only
+  called as part of a concurrent code update on a FSP based
+  system.
+- cpu: Clear PCR SPR in opal_reinit_cpus()
+
+  Currently if Linux boots with a non-zero PCR, things can go bad where
+  some early userspace programs can take illegal instructions. This is
+  being fixed in Linux, but in the mean time, we should cleanup in
+  skiboot also.
+- pci: Fix PCI_DEVICE_ID()
+
+  The vendor ID is 16 bits not 8. This error leaves the top of the vendor
+  ID in the bottom bits of the device ID, which resulted in e.g. a failure
+  to run the PCI quirk for the AST VGA device.
+- Quieten console output on boot
+
+  We print out a whole bunch of things on boot, most of which aren't
+  interesting, so we should *not* print them instead.
+
+  Printing things like what CPUs we found and what PCI devices we found
+  *are* useful, so continue to do that. But we don't need to splat out
+  a bunch of things that are always going to be true.
+- core/console: fix deadlock when printing with console lock held
+
+  Some debugging options will print while the console lock is held,
+  which is why the console lock is taken as a recursive lock.
+  However console_write calls __flush_console, which will drop and
+  re-take the lock non-recursively in some cases.
+
+  Just set con_need_flush and return from __flush_console if we are
+  holding the console lock already.
+
+  This stack usage message (taken with this patch applied) could lead
+  to a deadlock without this: ::
+
+    CPU 0000 lowest stack mark 11768 bytes left pc=300cb808 token=0
+    CPU 0000 Backtrace:
+    S: 0000000031c03370 R: 00000000300cb808   .list_check_node+0x1c
+    S: 0000000031c03410 R: 00000000300cb910   .list_check+0x38
+    S: 0000000031c034b0 R: 00000000300190ac   .try_lock_caller+0xb8
+    S: 0000000031c03540 R: 00000000300192e0   .lock_caller+0x80
+    S: 0000000031c03600 R: 0000000030012c70   .__flush_console+0x134
+    S: 0000000031c036d0 R: 00000000300130cc   .console_write+0x68
+    S: 0000000031c03780 R: 00000000300347bc   .vprlog+0xc8
+    S: 0000000031c03970 R: 0000000030034844   ._prlog+0x50
+    S: 0000000031c03a00 R: 00000000300364a4   .log_simple_error+0x74
+    S: 0000000031c03b90 R: 000000003004ab48   .occ_pstates_init+0x184
+    S: 0000000031c03d50 R: 000000003001480c   .load_and_boot_kernel+0x38c
+    S: 0000000031c03e30 R: 000000003001571c   .main_cpu_entry+0x62c
+    S: 0000000031c03f00 R: 0000000030002700   boot_entry+0x1c0
+- opal-prd: Do not error out on first failure for soft/hard offline.
+
+  The memory errors (CEs and UEs) that are detected as part of background
+  memory scrubbing are reported by PRD asynchronously to opal-prd along with
+  affected memory ranges. hservice_memory_error() converts these ranges into
+  page granularity before hooking up them to soft/hard offline-ing
+  infrastructure.
+
+  But the current implementation of hservice_memory_error() does not hookup
+  all the pages to soft/hard offline-ing if any of the page offline action
+  fails. e.g hard offline can fail for:
+
+  - Pages that are not part of buddy managed pool.
+  - Pages that are reserved by kernel using memblock_reserved()
+  - Pages that are in use by kernel.
+
+  But for the pages that are in use by user space application, the hard
+  offline marks the page as hwpoison, sends SIGBUS signal to kill the
+  affected application as recovery action and returns success.
+
+  Hence, It is possible that some of the pages in that memory range are in
+  use by application or free. By stopping on first error we loose the
+  opportunity to hwpoison the subsequent pages which may be free or in use by
+  application. This patch fixes this issue.
+- libflash/blocklevel_write: Fix missing error handling
+
+  Caught by scan-build, we seem to trap the errors in rc, but
+  not take any recovery action during blocklevel_write.
+
+I2C
+^^^
+- p8-i2c: fix wrong request status when a reset is needed
+
+  If the bus is found in error state when starting a new request, the
+  engine is reset and we enter recovery. However, once complete, the
+  reset operation shows a status of complete in the status register. So
+  any badly-timed called to check_status() will think the current top
+  request is complete, even though it hasn't run yet.
+
+  So don't update any request status while we are in recovery, as
+  nothing useful for the request is supposed to happen in that state.
+- p8-i2c: Remove force reset
+
+  Force reset was added as an attempt to work around some issues with TPM
+  devices locking up their I2C bus. In that particular case the problem
+  was that the device would hold the SCL line down permanently due to a
+  device firmware bug. The force reset doesn't actually do anything to
+  alleviate the situation here, it just happens to reset the internal
+  master state enough to make the I2C driver appear to work until
+  something tries to access the bus again.
+
+  On P9 systems with secure boot enabled there is the added problem
+  of the "diagostic mode" not being supported on I2C masters A,B,C and
+  D. Diagnostic mode allows the SCL and SDA lines to be driven directly
+  by software. Without this force reset is impossible to implement.
+
+  This patch removes the force reset functionality entirely since:
+
+  a) it doesn't do what it's supposed to, and
+  b) it's butt ugly code
+
+  Additionally, turn p8_i2c_reset_engine() into p8_i2c_reset_port().
+  There's no need to reset every port on a master in response to an
+  error that occurred on a specific port.
+- libstb/i2c-driver: Bump max timeout
+
+  We have observed some TPMs clock streching the I2C bus for signifigant
+  amounts of time when processing commands. The same TPMs also have
+  errata that can result in permernantly locking up a bus in response to
+  an I2C transaction they don't understand. Using an excessively long
+  timeout to prevent this in the field.
+- hdata: Add TPM timeout workaround
+
+  Set the default timeout for any bus containing a TPM to one second. This
+  is needed to work around a bug in the firmware of certain TPMs that will
+  clock strech the I2C port the for up to a second. Additionally, when the
+  TPM is clock streching it responds to a STOP condition on the bus by
+  bricking itself. Clearing this error requires a hard power cycle of the
+  system since the TPM is powered by standby power.
+- p8-i2c: Allow a per-port default timeout
+
+  Add support for setting a default timeout for the I2C port to the
+  device-tree. This is consumed by skiboot.
+
+IPMI Watchdog
+^^^^^^^^^^^^^
+- ipmi-watchdog: Support handling re-initialization
+
+  Watchdog resets can return an error code from the BMC indicating that
+  the BMC watchdog was not initialized. Currently we abort skiboot due to
+  a missing error handler. This patch implements handling
+  re-initialization for the watchdog, automatically saving the last
+  watchdog set values and re-issuing them if needed.
+- ipmi-watchdog: The stop action should disable reset
+
+  Otherwise it is possible for the reset timer to elapse and trigger the
+  watchdog to wake back up. This doesn't affect the behavior of the
+  system since we are providing a NONE action to the BMC. However we would
+  like to avoid the action from taking place if possible.
+- ipmi-watchdog: Add a flag to determine if we are still ticking
+
+  This makes it easier for future changes to ensure that the watchdog
+  stops ticking and doesn't requeue itself for execution in the
+  background. This way it is safe for resets to be performed after the
+  ticks are assumed to be stopped and it won't start the timer again.
+- ipmi-watchdog: (prepare for) not disabling at shutdown
+
+  The op-build linux kernel has been configured to support the ipmi
+  watchdog. This driver will always handle the watchdog by either leaving
+  it enabled if configured, or by disabling it during module load if no
+  configuration is provided. This increases the coverage of the watchdog
+  during the boot process. The watchdog should no longer be disabled at
+  any point during skiboot execution.
+
+  We're not enabling this by default yet as people can (and do, at least in
+  development) mix and match old BOOTKERNEL with new skiboot and we don't
+  want to break that too obviously.
+- ipmi-watchdog: Don't reset the watchdog twice
+
+  There is no clarification for why this change was needed, but presumably
+  this is due to a buggy BMC implementation where the Watchdog Set command
+  was processed concurrently or after the initial Watchdog Reset. This
+  inversion would cause the watchdog to stop since the DONT_STOP bit was
+  not set. Since we are now using the DONT_STOP bit during initialization,
+  the watchdog should not be stopped even if an inversion occurs.
+- ipmi-watchdog: Make it possible to set DONT_STOP
+
+  The IPMI standard supports setting a DONT_STOP bit during an Watchdog
+  Set operation. Most of the time we don't want to stop the Watchdog when
+  updating the settings so we should be using this bit. This patch makes
+  it possible for callers of set_wdt to prevent the watchdog from being
+  stopped. This only changes the behavior of the watchdog during the
+  initial settings update when initializing skiboot. The watchdog is no
+  longer disabled and then immediately re-enabled.
+- ipmi-watchdog: WD_POWER_CYCLE_ACTION -> WD_RESET_ACTION
+
+  The IPMI specification denotes that action 0x1 is Host Reset and 0x3 is
+  Host Power Cycle. Use the correct name for Reset in our watchdog code.
+
+
+POWER8 platforms
+----------------
+
+- astbmc: Enable mbox depending on scratch reg
+
+  P8 boxes can opt in for mbox pnor support if they set the scratch
+  register bit to indicate it is supported.
+
+Simulator platforms
+-------------------
+
+Since :ref:`skiboot-6.1-rc1`:
+
+- pmem: volatile bindings for the poorly enabled
+
+  PMEM_DISK bindings were added, but they rely on a rather
+  recent mmap feature. This patch steals from those bindings
+  to add volatile bindings. I've used these bindings with
+  PMEM_VOLATILE to launch an instance with the publicly
+  available systemsim-p9. The bindings are volatile and one
+  should not expect any data to be saved/retrieved.
+
+Since :ref:`skiboot-6.0`:
+
+- plat/qemu: add PNOR support
+
+  To access the PNOR, OPAL/skiboot drives the BMC SPI controller using
+  the iLPC2AHB device of the BMC SuperIO controller and accesses the
+  flash contents using the LPC FW address space on which the PNOR is
+  remapped.
+
+  The QEMU PowerNV machine now integrates such models (SuperIO
+  controller, iLPC2AHB device) and also a pseudo Aspeed SoC AHB memory
+  space populated with the SPI controller registers (same model as for
+  ARM). The AHB window giving access to the contents of the BMC SPI
+  controller flash modules is mapped on the LPC FW address space.
+
+  The change should be compatible for machine without PNOR support.
+- external/mambo: Add support for readline if it exists
+
+  Add support for tclreadline package if it is present.
+  This patch loads the package and uses it when the
+  simulation stops for any reason.
+
+
+FSP based platforms
+-------------------
+
+- Disable fast reboot on FSP IPL side change
+
+  If FSP changes next IPL side, then disable fast reboot.
+
+  sample output: ::
+
+      [  620.196442259,5] FSP: Got sysparam update, param ID 0xf0000007
+      [  620.196444501,5] CUPD: FW IPL side changed. Disable fast reboot
+      [  620.196445389,5] CUPD: Next IPL side : perm
+- fsp/console: Always establish OPAL console API backend
+
+  Currently we only call set_opal_console() to establish the backend
+  used by the OPAL console API if we find at least one FSP serial
+  port in HDAT.
+
+  On systems where there is none (IPMI only), we fail to set it,
+  causing the console code to try to use the dummy console causing
+  an assertion failure during boot due to clashing on the device-tree
+  node names.
+
+  So always set it if an FSP is present
+
+AST BMC based platforms
+-----------------------
+
+- AMI BMC: use 0x3a as OEM command
+
+  The 0x3a OEM command is for IBM commands, while 0x32 was for AMI ones.
+  Sometime in the P8 timeframe, AMI BMCs were changed to listen for our
+  commands on either 0x32 or 0x3a. Since 0x3a is the direction forward,
+  we'll use that, as P9 machines with AMI BMCs probably also want these
+  to work, and let's not bet that 0x32 will continue to be okay.
+- astbmc: Set romulus BMC type to OpenBMC
+- platform/astbmc: Do not delete compatible property
+
+  P9 onwards OPAL is building device tree for BMC based system using
+  HDAT. We are populating bmc/compatible node with bmc version. Hence
+  do not delete this property.
+
+Utilities
+---------
+- external/xscom-utils: Add python library for xscom access
+
+  Patch adds a simple python library module for xscom access.
+  It directly manipulate the '/access' file for scom read
+  and write from debugfs 'scom' directory.
+
+  Example on how to generate a getscom using this module:
+
+  .. code-block:: python
+
+     from adu_scoms import *
+     getscom = GetSCom()
+     getscom.parse_args()
+     getscom.run_command()
+
+  Sample output for above getscom.py:
+
+  .. code-block:: console
+
+    # ./getscom.py -l
+    Chip ID  | Rev   | Chip type
+    ---------|-------|-----------
+    00000008 | DD2.0 | P9 (Nimbus) processor
+    00000000 | DD2.0 | P9 (Nimbus) processor
+- ffspart: Don't require user to create blank partitions manually
+
+  Add '--allow-empty' which allows the filename for a given partition to
+  be blank. If set ffspart will set that part of the PNOR file 'blank' and
+  set ECC bits if required.
+  Without this option behaviour is unchanged and ffspart will return an
+  error if it can not find the partition file.
+- pflash: Use correct prefix when installing
+
+  pflash uses lowercase prefix when running make install in it's
+  direcetory, but uppercase PREFIX when running it in shared. Use
+  lowercase everywhere.
+
+  With this the OpenBMC bitbake recipie can drop an out of tree patch it's
+  been carrying for years.
+
+
+POWER9
+------
+
+Since :ref:`skiboot-6.1-rc1`:
+
+- occ: sensors: Fix the size of the phandle array 'sensors' in DT
+
+  Fixes: 99505c03f493 (present in v5.10-rc4)
+- phb4: Delay training till after PERST is deasserted
+
+  This helps some cards train on the second PERST (ie fast-reboot). The
+  reason is not clear why but it helps, so YOLO!
+
+Since :ref:`skiboot-6.0`:
+
+- occ-sensor: Avoid using uninitialised struct cpu_thread
+
+  When adding the sensors in occ_sensors_init, if the type is not
+  OCC_SENSOR_LOC_CORE, then the loop to find 'c' will not be executed.
+  Then c->pir is used for both of the the add_sensor_node calls below.
+
+  This provides a default value of 0 instead.
+- NX: Add NX coprocessor init opal call
+
+  The read offset (4:11) in Receive FIFO control register is incremented
+  by FIFO size whenever CRB read by NX. But the index in RxFIFO has to
+  match with the corresponding entry in FIFO maintained by VAS in kernel.
+  VAS entry is reset to 0 when opening the receive window during driver
+  initialization. So when NX842 is reloaded or in kexec boot, possibility
+  of mismatch between RxFIFO control register and VAS entries in kernel.
+  It could cause CRB failure / timeout from NX.
+
+  This patch adds nx_coproc_init opal call for kernel to initialize
+  readOffset (4:11) and Queued (15:23) in RxFIFO control register.
+- SLW: Remove stop1_lite and stop2_lite
+
+  stop1_lite has been removed since it adds no additional benefit
+  over stop0_lite. stop2_lite has been removed since currently it adds
+  minimal benefit over stop2. However, the benefit is eclipsed by the time
+  required to ungate the clocks
+
+  Moreover, Lite states don't give up the SMT resources, can potentially
+  have a performance impact on sibling threads.
+
+  Since current OSs (Linux) aren't smart enough to make good decisions
+  with these stop states, we're (temporarly) removing them from what
+  we expose to the OS, the idea being to bring them back in a new
+  DT representation so that only an OS that knows what to do will
+  do things with them.
+- cpu: Use STOP1 on POWER9 for idle/sleep inside OPAL
+
+  The current code requests STOP3, which means it gets STOP2 in practice.
+
+  STOP2 has proven to occasionally be unreliable depending on FW
+  version and chip revision, it also requires a functional CME,
+  so instead, let's use STOP1. The difference is rather minimum
+  for something that is only used a few seconds during boot.
+
+NPU2 (NVLink2 and OpenCAPI)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Since :ref:`skiboot-6.1-rc1`:
+
+- capi: Select the correct IODA table entry for the mbt cache.
+
+  With the current code, the capi mmio window is not correctly configured
+  in the IODA table entry. The first entry (generally the non-prefetchable
+  BAR) is overwrriten.
+  This patch sets the capi window bar at the right place.
+- npu2/hw-procedures: Fence bricks via NTL instead of MISC
+
+  There are a couple of places we can set/unset fence for a brick:
+
+  1. MISC register: NPU2_MISC_FENCE_STATE
+  2. NTL register for the brick: NPU2_NTL_MISC_CFG1(ndev)
+
+  Recent testing of ATS in combination with GPU reset has exposed a side
+  effect of using (1); if fence is set for all six bricks, it triggers a
+  sticky nmmu latch which prevents the NPU from getting ATR responses.
+  This manifests as a hang in the tests.
+
+  We have npu2_dev_fence_brick() which uses (1), and only two calls to it.
+  Replace the call which sets fence with a write to (2). Remove the
+  corresponding unset call entirely. It's unneeded because the procedures
+  already do a progression from full fence to half to idle using (2).
+
+- phb4/capp: Calculate STQ/DMA read engines based on link-width for PEC
+
+  Presently in CAPI mode the number of STQ/DMA-read engines allocated on
+  PEC2 for CAPP is fixed to 6 and 0-30 respectively irrespective of the
+  PCI link width. These values are only suitable for x8 cards and
+  quickly run out if a x16 card is plugged to a PEC2 attached slot. This
+  usually manifests as CAPP reporting TLBI timeout due to these messages
+  getting stalled due to insufficient STQs.
+
+  To fix this we update enable_capi_mode() to check if PEC2 chiplet is
+  in x16 mode and if yes then we allocate 4/0-47 STQ/DMA-read engines
+  for the CAPP traffic.
+
+  Fixes: 37ea3cfdc852 (present in v5.7-rc1)
+- npu2: Use same compatible string for NVLink and OpenCAPI link nodes in device tree
+
+  Currently, we distinguish between NPU links for NVLink devices and OpenCAPI
+  devices through the use of two different compatible strings - ibm,npu-link
+  and ibm,npu-link-opencapi.
+
+  As we move towards supporting configurations with both NVLink and OpenCAPI
+  devices behind a single NPU, we need to detect the device type as part of
+  presence detection, which can't happen until well after the point where the
+  HDAT or platform code has created the NPU device tree nodes. Changing a
+  node's compatible string after it's been created is a bit ugly, so instead
+  we should move the device type to a new property which we can add to the
+  node later on.
+
+  Get rid of the ibm,npu-link-opencapi compatible string, add a new
+  ibm,npu-link-type property, and a helper function to check the link type.
+  Add an "unknown" device type in preparation for later patches to detect
+  device type dynamically.
+
+  These device tree bindings are entirely internal to skiboot and are not
+  consumed directly by Linux, so this shouldn't break anything (other than
+  internal BML lab environments).
+- occ: Add support for GPU presence detection
+
+  On the Witherspoon platform, we need to distinguish between NVLink GPUs and
+  OpenCAPI accelerators. In order to do this, we first need to find out
+  whether the SXM2 socket is populated.
+
+  On Witherspoon, the SXM2 socket's presence detection pin is only visible
+  via I2C from the APSS, and thus can only be exposed to the host via the
+  OCC. The OCC, per OCC Firmware Interface Specification for POWER9 version
+  0.22, now exposes this to skiboot through a field in the dynamic data
+  shared memory.
+
+  Add the necessary dynamic data changes required to read the version and
+  GPU presence fields. Add a function, occ_get_gpu_presence(), that can be
+  used to check GPU presence.
+
+  If the OCC isn't reporting presence (old OCC firmware, or some other
+  reason), we default to assuming there is a device present and wait until
+  link training to fail.
+
+  This will be used in later patches to fix up the NPU2 probe path for
+  OpenCAPI support on Witherspoon.
+- hw/npu2, core/hmi: Use NPU instead of NPU2 as log message prefix
+
+  The NPU2{DBG,INF,ERR} macros use "NPU%d" as a prefix to identify messages
+  relating to a particular NPU.
+
+  It's slightly confusing to have per-NPU messages prefixed with "NPU0" or
+  "NPU1" and NPU-generic messages prefixed with "NPU2". On some future system
+  we could potentially have a NPU #2 in which case it'd be really confusing.
+
+  Use NPU rather than NPU2 for NPU-generic log messages. There's no risk of
+  confusion with the original npu.c code since that's only for P8.
+
+Since :ref:`skiboot-6.0`:
+
+- npu2: Reset NVLinks on hot reset
+
+  This effectively fences GPU RAM on GPU reset so the host system
+  does not have to crash every time we stop a KVM guest with a GPU
+  passed through.
+- npu2-opencapi: reduce number of retries to train the link
+
+  We've been reliably training the opencapi link on the first attempt
+  for quite a while. Furthermore, if it doesn't train on the first
+  attempt, retries haven't been that useful. So let's reduce the number
+  of attempts we do to train the link.
+
+  2 retries = 3 attempts to train.
+
+  Each (failed) training sequence costs about 3 seconds.
+- opal/hmi: Display correct chip id while printing NPU FIRs.
+
+  HMIs for NPU xstops are broadcasted to all chips. All cores on all the
+  chips receive HMI. HMI handler correctly identifies and extracts the
+  NPU FIR details from affected chip, but while printing FIR data it
+  prints chip id and location code details of this_cpu()->chip_id which
+  may not be correct. This patch fixes this issue.
+- npu2-opencapi: Fix link state to report link down
+
+  The PHB callback 'get_link_state' is always reporting the link width,
+  irrespective of the link status and even when the link is down. It is
+  causing too much work (and failures) when the PHB is probed during pci
+  init.
+  The fix is to look at the link status first and report the link as
+  down when appropriate.
+- npu2-opencapi: Cleanup traces printed during link training
+
+  Now that links may train in parallel, traces shown during training can
+  be all mixed up. So add a prefix to all the traces to clearly identify
+  the chip and link the trace refers to: ::
+
+    OCAPI[<chip id>:<link id>]: this is a very useful message
+
+  The lower-level hardware procedures (npu2-hw-procedures.c) also print
+  traces which would need work. But that code is being reworked to be
+  better integrated with opencapi and nvidia, so leave it alone for now.
+- npu2-opencapi: Train links on fundamental reset
+
+  Reorder our link training steps so that they are executed on
+  fundamental reset instead of during the initial setup. Skiboot always
+  call a fundamental reset on all the PHBs during pci init.
+
+  It is done through a state machine, similarly to what is done for
+  'real' PHBs.
+
+  This is the first step for a longer term goal to be able to trigger an
+  adapter reset from linux. We'll need the reset callbacks of the PHB to
+  be defined. We have to handle the various delays differently, since a
+  linux thread shouldn't stay stuck waiting in opal for too long.
+- npu2-opencapi: Rework adapter reset
+
+  Rework a bit the code to reset the opencapi adapter:
+
+  - make clearer which i2c pin is resetting which device
+  - break the reset operation in smaller chunks. This is really to
+    prepare for a future patch.
+
+  No functional changes.
+- npu2-opencapi: Use presence detection
+
+  Presence detection is not part of the opencapi specification. So each
+  platform may choose to implement it the way it wants.
+
+  All current platforms implement it through an i2c device where we can
+  query a pin to know if a device is connected or not. ZZ and Zaius have
+  a similar design and even use the same i2c information and pin
+  numbers.
+  However, presence detection on older ZZ planar (older than v4) doesn't
+  work, so we don't activate it for now, until our lab systems are
+  upgraded and it's better tested.
+
+  Presence detection on witherspoon is still being worked on. It's
+  shaping up to be quite different, so we may have to revisit the topic
+  in a later patch.
+
+Testing and CI
+--------------
+
+Since :ref:`skiboot-6.1-rc1`:
+
+- test/qemu: start building qemu again, and use our built qemu for tests
+
+  We need to use QEMU_BIN rather than QEMU as the makefiles define
+  QEMU already.
+- opal-ci: qemu: Use the powernv-3.0 branch
+
+  This is based off the current development version of Qemu, and
+  importantly it contains the patch that allows skiboot and Linux to clear
+  the PCR that we require to boot.
author	Stewart Smith <stewart@linux.ibm.com>	2018-07-11 18:26:16 +1000
committer	Stewart Smith <stewart@linux.ibm.com>	2018-07-11 18:27:41 +1000
commit	452998f4be5973d8884e3db5aa362bf40e11467a (patch)
tree	789bf277c3df2dde1bdebdde02ec1d17fc89dff7 /doc/release-notes
parent	4949e4417d0eabf327f4c84cb87a619ce4af3730 (diff)
download	talos-skiboot-452998f4be5973d8884e3db5aa362bf40e11467a.tar.gz talos-skiboot-452998f4be5973d8884e3db5aa362bf40e11467a.zip