diff options
Diffstat (limited to 'Documentation/driver-api')
39 files changed, 2678 insertions, 315 deletions
diff --git a/Documentation/driver-api/dell_rbu.rst b/Documentation/driver-api/dell_rbu.rst deleted file mode 100644 index 5d1ce7bcd04d..000000000000 --- a/Documentation/driver-api/dell_rbu.rst +++ /dev/null @@ -1,128 +0,0 @@ -============================================================= -Usage of the new open sourced rbu (Remote BIOS Update) driver -============================================================= - -Purpose -======= - -Document demonstrating the use of the Dell Remote BIOS Update driver. -for updating BIOS images on Dell servers and desktops. - -Scope -===== - -This document discusses the functionality of the rbu driver only. -It does not cover the support needed from applications to enable the BIOS to -update itself with the image downloaded in to the memory. - -Overview -======== - -This driver works with Dell OpenManage or Dell Update Packages for updating -the BIOS on Dell servers (starting from servers sold since 1999), desktops -and notebooks (starting from those sold in 2005). - -Please go to http://support.dell.com register and you can find info on -OpenManage and Dell Update packages (DUP). - -Libsmbios can also be used to update BIOS on Dell systems go to -http://linux.dell.com/libsmbios/ for details. - -Dell_RBU driver supports BIOS update using the monolithic image and packetized -image methods. In case of monolithic the driver allocates a contiguous chunk -of physical pages having the BIOS image. In case of packetized the app -using the driver breaks the image in to packets of fixed sizes and the driver -would place each packet in contiguous physical memory. The driver also -maintains a link list of packets for reading them back. - -If the dell_rbu driver is unloaded all the allocated memory is freed. - -The rbu driver needs to have an application (as mentioned above)which will -inform the BIOS to enable the update in the next system reboot. - -The user should not unload the rbu driver after downloading the BIOS image -or updating. - -The driver load creates the following directories under the /sys file system:: - - /sys/class/firmware/dell_rbu/loading - /sys/class/firmware/dell_rbu/data - /sys/devices/platform/dell_rbu/image_type - /sys/devices/platform/dell_rbu/data - /sys/devices/platform/dell_rbu/packet_size - -The driver supports two types of update mechanism; monolithic and packetized. -These update mechanism depends upon the BIOS currently running on the system. -Most of the Dell systems support a monolithic update where the BIOS image is -copied to a single contiguous block of physical memory. - -In case of packet mechanism the single memory can be broken in smaller chunks -of contiguous memory and the BIOS image is scattered in these packets. - -By default the driver uses monolithic memory for the update type. This can be -changed to packets during the driver load time by specifying the load -parameter image_type=packet. This can also be changed later as below:: - - echo packet > /sys/devices/platform/dell_rbu/image_type - -In packet update mode the packet size has to be given before any packets can -be downloaded. It is done as below:: - - echo XXXX > /sys/devices/platform/dell_rbu/packet_size - -In the packet update mechanism, the user needs to create a new file having -packets of data arranged back to back. It can be done as follows -The user creates packets header, gets the chunk of the BIOS image and -places it next to the packetheader; now, the packetheader + BIOS image chunk -added together should match the specified packet_size. This makes one -packet, the user needs to create more such packets out of the entire BIOS -image file and then arrange all these packets back to back in to one single -file. - -This file is then copied to /sys/class/firmware/dell_rbu/data. -Once this file gets to the driver, the driver extracts packet_size data from -the file and spreads it across the physical memory in contiguous packet_sized -space. - -This method makes sure that all the packets get to the driver in a single operation. - -In monolithic update the user simply get the BIOS image (.hdr file) and copies -to the data file as is without any change to the BIOS image itself. - -Do the steps below to download the BIOS image. - -1) echo 1 > /sys/class/firmware/dell_rbu/loading -2) cp bios_image.hdr /sys/class/firmware/dell_rbu/data -3) echo 0 > /sys/class/firmware/dell_rbu/loading - -The /sys/class/firmware/dell_rbu/ entries will remain till the following is -done. - -:: - - echo -1 > /sys/class/firmware/dell_rbu/loading - -Until this step is completed the driver cannot be unloaded. - -Also echoing either mono, packet or init in to image_type will free up the -memory allocated by the driver. - -If a user by accident executes steps 1 and 3 above without executing step 2; -it will make the /sys/class/firmware/dell_rbu/ entries disappear. - -The entries can be recreated by doing the following:: - - echo init > /sys/devices/platform/dell_rbu/image_type - -.. note:: echoing init in image_type does not change it original value. - -Also the driver provides /sys/devices/platform/dell_rbu/data readonly file to -read back the image downloaded. - -.. note:: - - After updating the BIOS image a user mode application needs to execute - code which sends the BIOS update request to the BIOS. So on the next reboot - the BIOS knows about the new image downloaded and it updates itself. - Also don't unload the rbu driver if the image has to be updated. - diff --git a/Documentation/driver-api/devfreq.rst b/Documentation/driver-api/devfreq.rst new file mode 100644 index 000000000000..4a0bf87a3b13 --- /dev/null +++ b/Documentation/driver-api/devfreq.rst @@ -0,0 +1,30 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======================== +Device Frequency Scaling +======================== + +Introduction +------------ + +This framework provides a standard kernel interface for Dynamic Voltage and +Frequency Switching on arbitrary devices. + +It exposes controls for adjusting frequency through sysfs files which are +similar to the cpufreq subsystem. + +Devices for which current usage can be measured can have their frequency +automatically adjusted by governors. + +API +--- + +Device drivers need to initialize a :c:type:`devfreq_profile` and call the +:c:func:`devfreq_add_device` function to create a :c:type:`devfreq` instance. + +.. kernel-doc:: include/linux/devfreq.h +.. kernel-doc:: include/linux/devfreq-event.h +.. kernel-doc:: drivers/devfreq/devfreq.c + :export: +.. kernel-doc:: drivers/devfreq/devfreq-event.c + :export: diff --git a/Documentation/driver-api/device_link.rst b/Documentation/driver-api/device_link.rst index ae1e3d0394b0..bc2d89af88ce 100644 --- a/Documentation/driver-api/device_link.rst +++ b/Documentation/driver-api/device_link.rst @@ -78,8 +78,8 @@ typically deleted in its ``->remove`` callback for symmetry. That way, if the driver is compiled as a module, the device link is added on module load and orderly deleted on unload. The same restrictions that apply to device link addition (e.g. exclusion of a parallel suspend/resume transition) apply equally -to deletion. Device links with ``DL_FLAG_STATELESS`` unset (i.e. managed -device links) are deleted automatically by the driver core. +to deletion. Device links managed by the driver core are deleted automatically +by it. Several flags may be specified on device link addition, two of which have already been mentioned above: ``DL_FLAG_STATELESS`` to express that no @@ -281,7 +281,8 @@ State machine :c:func:`driver_bound()`.) * Before a consumer device is probed, presence of supplier drivers is - verified by checking that links to suppliers are in ``DL_STATE_AVAILABLE`` + verified by checking the consumer device is not in the wait_for_suppliers + list and by checking that links to suppliers are in ``DL_STATE_AVAILABLE`` state. The state of the links is updated to ``DL_STATE_CONSUMER_PROBE``. (Call to :c:func:`device_links_check_suppliers()` from :c:func:`really_probe()`.) diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst index b541e97c7ab1..c78db28519f7 100644 --- a/Documentation/driver-api/dma-buf.rst +++ b/Documentation/driver-api/dma-buf.rst @@ -118,13 +118,13 @@ Kernel Functions and Structures Reference Reservation Objects ------------------- -.. kernel-doc:: drivers/dma-buf/reservation.c +.. kernel-doc:: drivers/dma-buf/dma-resv.c :doc: Reservation Object Overview -.. kernel-doc:: drivers/dma-buf/reservation.c +.. kernel-doc:: drivers/dma-buf/dma-resv.c :export: -.. kernel-doc:: include/linux/reservation.h +.. kernel-doc:: include/linux/dma-resv.h :internal: DMA Fences diff --git a/Documentation/driver-api/dmaengine/client.rst b/Documentation/driver-api/dmaengine/client.rst index 45953f171500..e5953e7e4bf4 100644 --- a/Documentation/driver-api/dmaengine/client.rst +++ b/Documentation/driver-api/dmaengine/client.rst @@ -151,6 +151,93 @@ The details of these operations are: Note that callbacks will always be invoked from the DMA engines tasklet, never from interrupt context. +Optional: per descriptor metadata +--------------------------------- + DMAengine provides two ways for metadata support. + + DESC_METADATA_CLIENT + + The metadata buffer is allocated/provided by the client driver and it is + attached to the descriptor. + + .. code-block:: c + + int dmaengine_desc_attach_metadata(struct dma_async_tx_descriptor *desc, + void *data, size_t len); + + DESC_METADATA_ENGINE + + The metadata buffer is allocated/managed by the DMA driver. The client + driver can ask for the pointer, maximum size and the currently used size of + the metadata and can directly update or read it. + + Becasue the DMA driver manages the memory area containing the metadata, + clients must make sure that they do not try to access or get the pointer + after their transfer completion callback has run for the descriptor. + If no completion callback has been defined for the transfer, then the + metadata must not be accessed after issue_pending. + In other words: if the aim is to read back metadata after the transfer is + completed, then the client must use completion callback. + + .. code-block:: c + + void *dmaengine_desc_get_metadata_ptr(struct dma_async_tx_descriptor *desc, + size_t *payload_len, size_t *max_len); + + int dmaengine_desc_set_metadata_len(struct dma_async_tx_descriptor *desc, + size_t payload_len); + + Client drivers can query if a given mode is supported with: + + .. code-block:: c + + bool dmaengine_is_metadata_mode_supported(struct dma_chan *chan, + enum dma_desc_metadata_mode mode); + + Depending on the used mode client drivers must follow different flow. + + DESC_METADATA_CLIENT + + - DMA_MEM_TO_DEV / DEV_MEM_TO_MEM: + 1. prepare the descriptor (dmaengine_prep_*) + construct the metadata in the client's buffer + 2. use dmaengine_desc_attach_metadata() to attach the buffer to the + descriptor + 3. submit the transfer + - DMA_DEV_TO_MEM: + 1. prepare the descriptor (dmaengine_prep_*) + 2. use dmaengine_desc_attach_metadata() to attach the buffer to the + descriptor + 3. submit the transfer + 4. when the transfer is completed, the metadata should be available in the + attached buffer + + DESC_METADATA_ENGINE + + - DMA_MEM_TO_DEV / DEV_MEM_TO_MEM: + 1. prepare the descriptor (dmaengine_prep_*) + 2. use dmaengine_desc_get_metadata_ptr() to get the pointer to the + engine's metadata area + 3. update the metadata at the pointer + 4. use dmaengine_desc_set_metadata_len() to tell the DMA engine the + amount of data the client has placed into the metadata buffer + 5. submit the transfer + - DMA_DEV_TO_MEM: + 1. prepare the descriptor (dmaengine_prep_*) + 2. submit the transfer + 3. on transfer completion, use dmaengine_desc_get_metadata_ptr() to get + the pointer to the engine's metadata area + 4. read out the metadata from the pointer + + .. note:: + + When DESC_METADATA_ENGINE mode is used the metadata area for the descriptor + is no longer valid after the transfer has been completed (valid up to the + point when the completion callback returns if used). + + Mixed use of DESC_METADATA_CLIENT / DESC_METADATA_ENGINE is not allowed, + client drivers must use either of the modes per descriptor. + 4. Submit the transaction Once the descriptor has been prepared and the callback information diff --git a/Documentation/driver-api/dmaengine/index.rst b/Documentation/driver-api/dmaengine/index.rst index 3026fa975937..b9df904d0a79 100644 --- a/Documentation/driver-api/dmaengine/index.rst +++ b/Documentation/driver-api/dmaengine/index.rst @@ -47,7 +47,7 @@ This book adds some notes about PXA DMA pxa_dma -.. only:: subproject +.. only:: subproject and html Indices ======= diff --git a/Documentation/driver-api/dmaengine/provider.rst b/Documentation/driver-api/dmaengine/provider.rst index dfc4486b5743..790a15089f1f 100644 --- a/Documentation/driver-api/dmaengine/provider.rst +++ b/Documentation/driver-api/dmaengine/provider.rst @@ -247,6 +247,54 @@ after each transfer. In case of a ring buffer, they may loop (DMA_CYCLIC). Addresses pointing to a device's register (e.g. a FIFO) are typically fixed. +Per descriptor metadata support +------------------------------- +Some data movement architecture (DMA controller and peripherals) uses metadata +associated with a transaction. The DMA controller role is to transfer the +payload and the metadata alongside. +The metadata itself is not used by the DMA engine itself, but it contains +parameters, keys, vectors, etc for peripheral or from the peripheral. + +The DMAengine framework provides a generic ways to facilitate the metadata for +descriptors. Depending on the architecture the DMA driver can implement either +or both of the methods and it is up to the client driver to choose which one +to use. + +- DESC_METADATA_CLIENT + + The metadata buffer is allocated/provided by the client driver and it is + attached (via the dmaengine_desc_attach_metadata() helper to the descriptor. + + From the DMA driver the following is expected for this mode: + - DMA_MEM_TO_DEV / DEV_MEM_TO_MEM + The data from the provided metadata buffer should be prepared for the DMA + controller to be sent alongside of the payload data. Either by copying to a + hardware descriptor, or highly coupled packet. + - DMA_DEV_TO_MEM + On transfer completion the DMA driver must copy the metadata to the client + provided metadata buffer before notifying the client about the completion. + After the transfer completion, DMA drivers must not touch the metadata + buffer provided by the client. + +- DESC_METADATA_ENGINE + + The metadata buffer is allocated/managed by the DMA driver. The client driver + can ask for the pointer, maximum size and the currently used size of the + metadata and can directly update or read it. dmaengine_desc_get_metadata_ptr() + and dmaengine_desc_set_metadata_len() is provided as helper functions. + + From the DMA driver the following is expected for this mode: + - get_metadata_ptr + Should return a pointer for the metadata buffer, the maximum size of the + metadata buffer and the currently used / valid (if any) bytes in the buffer. + - set_metadata_len + It is called by the clients after it have placed the metadata to the buffer + to let the DMA driver know the number of valid bytes provided. + + Note: since the client will ask for the metadata pointer in the completion + callback (in DMA_DEV_TO_MEM case) the DMA driver must ensure that the + descriptor is not freed up prior the callback is called. + Device operations ----------------- diff --git a/Documentation/driver-api/driver-model/devres.rst b/Documentation/driver-api/driver-model/devres.rst index a100bef54952..46c13780994c 100644 --- a/Documentation/driver-api/driver-model/devres.rst +++ b/Documentation/driver-api/driver-model/devres.rst @@ -267,6 +267,8 @@ DRM GPIO devm_gpiod_get() + devm_gpiod_get_array() + devm_gpiod_get_array_optional() devm_gpiod_get_index() devm_gpiod_get_index_optional() devm_gpiod_get_optional() @@ -313,9 +315,13 @@ IOMAP devm_ioport_map() devm_ioport_unmap() devm_ioremap() - devm_ioremap_nocache() + devm_ioremap_uc() devm_ioremap_wc() devm_ioremap_resource() : checks resource, requests memory region, ioremaps + devm_ioremap_resource_wc() + devm_platform_ioremap_resource() : calls devm_ioremap_resource() for platform device + devm_platform_ioremap_resource_wc() + devm_platform_ioremap_resource_byname() devm_iounmap() pcim_iomap() pcim_iomap_regions() : do request_region() and iomap() on multiple BARs diff --git a/Documentation/driver-api/driver-model/driver.rst b/Documentation/driver-api/driver-model/driver.rst index 11d281506a04..baa6a85c8287 100644 --- a/Documentation/driver-api/driver-model/driver.rst +++ b/Documentation/driver-api/driver-model/driver.rst @@ -169,6 +169,49 @@ A driver's probe() may return a negative errno value to indicate that the driver did not bind to this device, in which case it should have released all resources it allocated:: + void (*sync_state)(struct device *dev); + +sync_state is called only once for a device. It's called when all the consumer +devices of the device have successfully probed. The list of consumers of the +device is obtained by looking at the device links connecting that device to its +consumer devices. + +The first attempt to call sync_state() is made during late_initcall_sync() to +give firmware and drivers time to link devices to each other. During the first +attempt at calling sync_state(), if all the consumers of the device at that +point in time have already probed successfully, sync_state() is called right +away. If there are no consumers of the device during the first attempt, that +too is considered as "all consumers of the device have probed" and sync_state() +is called right away. + +If during the first attempt at calling sync_state() for a device, there are +still consumers that haven't probed successfully, the sync_state() call is +postponed and reattempted in the future only when one or more consumers of the +device probe successfully. If during the reattempt, the driver core finds that +there are one or more consumers of the device that haven't probed yet, then +sync_state() call is postponed again. + +A typical use case for sync_state() is to have the kernel cleanly take over +management of devices from the bootloader. For example, if a device is left on +and at a particular hardware configuration by the bootloader, the device's +driver might need to keep the device in the boot configuration until all the +consumers of the device have probed. Once all the consumers of the device have +probed, the device's driver can synchronize the hardware state of the device to +match the aggregated software state requested by all the consumers. Hence the +name sync_state(). + +While obvious examples of resources that can benefit from sync_state() include +resources such as regulator, sync_state() can also be useful for complex +resources like IOMMUs. For example, IOMMUs with multiple consumers (devices +whose addresses are remapped by the IOMMU) might need to keep their mappings +fixed at (or additive to) the boot configuration until all its consumers have +probed. + +While the typical use case for sync_state() is to have the kernel cleanly take +over management of devices from the bootloader, the usage of sync_state() is +not restricted to that. Use it whenever it makes sense to take an action after +all the consumers of a device have probed. + int (*remove) (struct device *dev); remove is called to unbind a driver from a device. This may be diff --git a/Documentation/driver-api/generic-counter.rst b/Documentation/driver-api/generic-counter.rst index 8382f01a53e3..e622f8f6e56a 100644 --- a/Documentation/driver-api/generic-counter.rst +++ b/Documentation/driver-api/generic-counter.rst @@ -7,7 +7,7 @@ Generic Counter Interface Introduction ============ -Counter devices are prevalent within a diverse spectrum of industries. +Counter devices are prevalent among a diverse spectrum of industries. The ubiquitous presence of these devices necessitates a common interface and standard of interaction and exposure. This driver API attempts to resolve the issue of duplicate code found among existing counter device @@ -26,23 +26,72 @@ the Generic Counter interface. There are three core components to a counter: -* Count: - Count data for a set of Signals. - * Signal: - Input data that is evaluated by the counter to determine the count - data. + Stream of data to be evaluated by the counter. * Synapse: - The association of a Signal with a respective Count. + Association of a Signal, and evaluation trigger, with a Count. + +* Count: + Accumulation of the effects of connected Synapses. + +SIGNAL +------ +A Signal represents a stream of data. This is the input data that is +evaluated by the counter to determine the count data; e.g. a quadrature +signal output line of a rotary encoder. Not all counter devices provide +user access to the Signal data, so exposure is optional for drivers. + +When the Signal data is available for user access, the Generic Counter +interface provides the following available signal values: + +* SIGNAL_LOW: + Signal line is in a low state. + +* SIGNAL_HIGH: + Signal line is in a high state. + +A Signal may be associated with one or more Counts. + +SYNAPSE +------- +A Synapse represents the association of a Signal with a Count. Signal +data affects respective Count data, and the Synapse represents this +relationship. + +The Synapse action mode specifies the Signal data condition that +triggers the respective Count's count function evaluation to update the +count data. The Generic Counter interface provides the following +available action modes: + +* None: + Signal does not trigger the count function. In Pulse-Direction count + function mode, this Signal is evaluated as Direction. + +* Rising Edge: + Low state transitions to high state. + +* Falling Edge: + High state transitions to low state. + +* Both Edges: + Any state transition. + +A counter is defined as a set of input signals associated with count +data that are generated by the evaluation of the state of the associated +input signals as defined by the respective count functions. Within the +context of the Generic Counter interface, a counter consists of Counts +each associated with a set of Signals, whose respective Synapse +instances represent the count function update conditions for the +associated Counts. + +A Synapse associates one Signal with one Count. COUNT ----- -A Count represents the count data for a set of Signals. The Generic -Counter interface provides the following available count data types: - -* COUNT_POSITION: - Unsigned integer value representing position. +A Count represents the accumulation of the effects of connected +Synapses; i.e. the count data for a set of Signals. The Generic +Counter interface represents the count data as a natural number. A Count has a count function mode which represents the update behavior for the count data. The Generic Counter interface provides the following @@ -86,60 +135,7 @@ available count function modes: Any state transition on either quadrature pair signals updates the respective count. Quadrature encoding determines the direction. -A Count has a set of one or more associated Signals. - -SIGNAL ------- -A Signal represents a counter input data; this is the input data that is -evaluated by the counter to determine the count data; e.g. a quadrature -signal output line of a rotary encoder. Not all counter devices provide -user access to the Signal data. - -The Generic Counter interface provides the following available signal -data types for when the Signal data is available for user access: - -* SIGNAL_LEVEL: - Signal line state level. The following states are possible: - - - SIGNAL_LEVEL_LOW: - Signal line is in a low state. - - - SIGNAL_LEVEL_HIGH: - Signal line is in a high state. - -A Signal may be associated with one or more Counts. - -SYNAPSE -------- -A Synapse represents the association of a Signal with a respective -Count. Signal data affects respective Count data, and the Synapse -represents this relationship. - -The Synapse action mode specifies the Signal data condition which -triggers the respective Count's count function evaluation to update the -count data. The Generic Counter interface provides the following -available action modes: - -* None: - Signal does not trigger the count function. In Pulse-Direction count - function mode, this Signal is evaluated as Direction. - -* Rising Edge: - Low state transitions to high state. - -* Falling Edge: - High state transitions to low state. - -* Both Edges: - Any state transition. - -A counter is defined as a set of input signals associated with count -data that are generated by the evaluation of the state of the associated -input signals as defined by the respective count functions. Within the -context of the Generic Counter interface, a counter consists of Counts -each associated with a set of Signals, whose respective Synapse -instances represent the count function update conditions for the -associated Counts. +A Count has a set of one or more associated Synapses. Paradigm ======== @@ -286,10 +282,36 @@ if device memory-managed registration is desired. Extension sysfs attributes can be created for auxiliary functionality and data by passing in defined counter_device_ext, counter_count_ext, and counter_signal_ext structures. In these cases, the -counter_device_ext structure is used for global configuration of the -respective Counter device, while the counter_count_ext and -counter_signal_ext structures allow for auxiliary exposure and -configuration of a specific Count or Signal respectively. +counter_device_ext structure is used for global/miscellaneous exposure +and configuration of the respective Counter device, while the +counter_count_ext and counter_signal_ext structures allow for auxiliary +exposure and configuration of a specific Count or Signal respectively. + +Determining the type of extension to create is a matter of scope. + +* Signal extensions are attributes that expose information/control + specific to a Signal. These types of attributes will exist under a + Signal's directory in sysfs. + + For example, if you have an invert feature for a Signal, you can have + a Signal extension called "invert" that toggles that feature: + /sys/bus/counter/devices/counterX/signalY/invert + +* Count extensions are attributes that expose information/control + specific to a Count. These type of attributes will exist under a + Count's directory in sysfs. + + For example, if you want to pause/unpause a Count from updating, you + can have a Count extension called "enable" that toggles such: + /sys/bus/counter/devices/counterX/countY/enable + +* Device extensions are attributes that expose information/control + non-specific to a particular Count or Signal. This is where you would + put your global features or other miscellanous functionality. + + For example, if your device has an overtemp sensor, you can report the + chip overheated via a device extension called "error_overtemp": + /sys/bus/counter/devices/counterX/error_overtemp Architecture ============ diff --git a/Documentation/driver-api/bt8xxgpio.rst b/Documentation/driver-api/gpio/bt8xxgpio.rst index a845feb074de..d7e75f1234e7 100644 --- a/Documentation/driver-api/bt8xxgpio.rst +++ b/Documentation/driver-api/gpio/bt8xxgpio.rst @@ -2,7 +2,7 @@ A driver for a selfmade cheap BT8xx based PCI GPIO-card (bt8xxgpio) =================================================================== -For advanced documentation, see http://www.bu3sch.de/btgpio.php +For advanced documentation, see https://bues.ch/cms/unmaintained/btgpio.html A generic digital 24-port PCI GPIO card can be built out of an ordinary Brooktree bt848, bt849, bt878 or bt879 based analog TV tuner card. The diff --git a/Documentation/driver-api/gpio/driver.rst b/Documentation/driver-api/gpio/driver.rst index 921c71a3d683..871922529332 100644 --- a/Documentation/driver-api/gpio/driver.rst +++ b/Documentation/driver-api/gpio/driver.rst @@ -5,7 +5,7 @@ GPIO Driver Interface This document serves as a guide for writers of GPIO chip drivers. Each GPIO controller driver needs to include the following header, which defines -the structures used to define a GPIO driver: +the structures used to define a GPIO driver:: #include <linux/gpio/driver.h> @@ -69,9 +69,9 @@ driver code: The code implementing a gpio_chip should support multiple instances of the controller, preferably using the driver model. That code will configure each -gpio_chip and issue ``gpiochip_add[_data]()`` or ``devm_gpiochip_add_data()``. -Removing a GPIO controller should be rare; use ``[devm_]gpiochip_remove()`` -when it is unavoidable. +gpio_chip and issue gpiochip_add(), gpiochip_add_data(), or +devm_gpiochip_add_data(). Removing a GPIO controller should be rare; use +gpiochip_remove() when it is unavoidable. Often a gpio_chip is part of an instance-specific structure with states not exposed by the GPIO interfaces, such as addressing, power management, and more. @@ -259,7 +259,7 @@ most often cascaded off a parent interrupt controller, and in some special cases the GPIO logic is melded with a SoC's primary interrupt controller. The IRQ portions of the GPIO block are implemented using an irq_chip, using -the header <linux/irq.h>. So basically such a driver is utilizing two sub- +the header <linux/irq.h>. So this combined driver is utilizing two sub- systems simultaneously: gpio and irq. It is legal for any IRQ consumer to request an IRQ from any irqchip even if it @@ -391,14 +391,115 @@ Infrastructure helpers for GPIO irqchips ---------------------------------------- To help out in handling the set-up and management of GPIO irqchips and the -associated irqdomain and resource allocation callbacks, the gpiolib has -some helpers that can be enabled by selecting the GPIOLIB_IRQCHIP Kconfig -symbol: - -- gpiochip_irqchip_add(): adds a chained cascaded irqchip to a gpiochip. It - will pass the struct gpio_chip* for the chip to all IRQ callbacks, so the - callbacks need to embed the gpio_chip in its state container and obtain a - pointer to the container using container_of(). +associated irqdomain and resource allocation callbacks. These are activated +by selecting the Kconfig symbol GPIOLIB_IRQCHIP. If the symbol +IRQ_DOMAIN_HIERARCHY is also selected, hierarchical helpers will also be +provided. A big portion of overhead code will be managed by gpiolib, +under the assumption that your interrupts are 1-to-1-mapped to the +GPIO line index: + +.. csv-table:: + :header: GPIO line offset, Hardware IRQ + + 0,0 + 1,1 + 2,2 + ...,... + ngpio-1, ngpio-1 + + +If some GPIO lines do not have corresponding IRQs, the bitmask valid_mask +and the flag need_valid_mask in gpio_irq_chip can be used to mask off some +lines as invalid for associating with IRQs. + +The preferred way to set up the helpers is to fill in the +struct gpio_irq_chip inside struct gpio_chip before adding the gpio_chip. +If you do this, the additional irq_chip will be set up by gpiolib at the +same time as setting up the rest of the GPIO functionality. The following +is a typical example of a cascaded interrupt handler using gpio_irq_chip:: + +.. code-block:: c + + /* Typical state container with dynamic irqchip */ + struct my_gpio { + struct gpio_chip gc; + struct irq_chip irq; + }; + + int irq; /* from platform etc */ + struct my_gpio *g; + struct gpio_irq_chip *girq; + + /* Set up the irqchip dynamically */ + g->irq.name = "my_gpio_irq"; + g->irq.irq_ack = my_gpio_ack_irq; + g->irq.irq_mask = my_gpio_mask_irq; + g->irq.irq_unmask = my_gpio_unmask_irq; + g->irq.irq_set_type = my_gpio_set_irq_type; + + /* Get a pointer to the gpio_irq_chip */ + girq = &g->gc.irq; + girq->chip = &g->irq; + girq->parent_handler = ftgpio_gpio_irq_handler; + girq->num_parents = 1; + girq->parents = devm_kcalloc(dev, 1, sizeof(*girq->parents), + GFP_KERNEL); + if (!girq->parents) + return -ENOMEM; + girq->default_type = IRQ_TYPE_NONE; + girq->handler = handle_bad_irq; + girq->parents[0] = irq; + + return devm_gpiochip_add_data(dev, &g->gc, g); + +The helper support using hierarchical interrupt controllers as well. +In this case the typical set-up will look like this:: + +.. code-block:: c + + /* Typical state container with dynamic irqchip */ + struct my_gpio { + struct gpio_chip gc; + struct irq_chip irq; + struct fwnode_handle *fwnode; + }; + + int irq; /* from platform etc */ + struct my_gpio *g; + struct gpio_irq_chip *girq; + + /* Set up the irqchip dynamically */ + g->irq.name = "my_gpio_irq"; + g->irq.irq_ack = my_gpio_ack_irq; + g->irq.irq_mask = my_gpio_mask_irq; + g->irq.irq_unmask = my_gpio_unmask_irq; + g->irq.irq_set_type = my_gpio_set_irq_type; + + /* Get a pointer to the gpio_irq_chip */ + girq = &g->gc.irq; + girq->chip = &g->irq; + girq->default_type = IRQ_TYPE_NONE; + girq->handler = handle_bad_irq; + girq->fwnode = g->fwnode; + girq->parent_domain = parent; + girq->child_to_parent_hwirq = my_gpio_child_to_parent_hwirq; + + return devm_gpiochip_add_data(dev, &g->gc, g); + +As you can see pretty similar, but you do not supply a parent handler for +the IRQ, instead a parent irqdomain, an fwnode for the hardware and +a funcion .child_to_parent_hwirq() that has the purpose of looking up +the parent hardware irq from a child (i.e. this gpio chip) hardware irq. +As always it is good to look at examples in the kernel tree for advice +on how to find the required pieces. + +The old way of adding irqchips to gpiochips after registration is also still +available but we try to move away from this: + +- DEPRECATED: gpiochip_irqchip_add(): adds a chained cascaded irqchip to a + gpiochip. It will pass the struct gpio_chip* for the chip to all IRQ + callbacks, so the callbacks need to embed the gpio_chip in its state + container and obtain a pointer to the container using container_of(). (See Documentation/driver-api/driver-model/design-patterns.rst) - gpiochip_irqchip_add_nested(): adds a nested cascaded irqchip to a gpiochip, @@ -406,11 +507,6 @@ symbol: cascaded irq has to be handled by a threaded interrupt handler. Apart from that it works exactly like the chained irqchip. -- gpiochip_set_chained_irqchip(): sets up a chained cascaded irq handler for a - gpio_chip from a parent IRQ and passes the struct gpio_chip* as handler - data. Notice that we pass is as the handler data, since the irqchip data is - likely used by the parent irqchip. - - gpiochip_set_nested_irqchip(): sets up a nested cascaded irq handler for a gpio_chip from a parent IRQ. As the parent IRQ has usually been explicitly requested by the driver, this does very little more than @@ -418,11 +514,11 @@ symbol: If there is a need to exclude certain GPIO lines from the IRQ domain handled by these helpers, we can set .irq.need_valid_mask of the gpiochip before -``[devm_]gpiochip_add_data()`` is called. This allocates an .irq.valid_mask with as -many bits set as there are GPIO lines in the chip, each bit representing line -0..n-1. Drivers can exclude GPIO lines by clearing bits from this mask. The mask -must be filled in before gpiochip_irqchip_add() or gpiochip_irqchip_add_nested() -is called. +devm_gpiochip_add_data() or gpiochip_add_data() is called. This allocates an +.irq.valid_mask with as many bits set as there are GPIO lines in the chip, each +bit representing line 0..n-1. Drivers can exclude GPIO lines by clearing bits +from this mask. The mask must be filled in before gpiochip_irqchip_add() or +gpiochip_irqchip_add_nested() is called. To use the helpers please keep the following in mind: diff --git a/Documentation/driver-api/gpio/drivers-on-gpio.rst b/Documentation/driver-api/gpio/drivers-on-gpio.rst index f3a189320e11..820b403d50f6 100644 --- a/Documentation/driver-api/gpio/drivers-on-gpio.rst +++ b/Documentation/driver-api/gpio/drivers-on-gpio.rst @@ -95,7 +95,7 @@ to emulate MCTRL (modem control) signals CTS/RTS by using two GPIO lines. The MTD NOR flash has add-ons for extra GPIO lines too, though the address bus is usually connected directly to the flash. -Use those instead of talking directly to the GPIOs using sysfs; they integrate -with kernel frameworks better than your userspace code could. Needless to say, -just using the appropriate kernel drivers will simplify and speed up your -embedded hacking in particular by providing ready-made components. +Use those instead of talking directly to the GPIOs from userspace; they +integrate with kernel frameworks better than your userspace code could. +Needless to say, just using the appropriate kernel drivers will simplify and +speed up your embedded hacking in particular by providing ready-made components. diff --git a/Documentation/driver-api/gpio/index.rst b/Documentation/driver-api/gpio/index.rst index c5b8467f9104..1d48fe248f05 100644 --- a/Documentation/driver-api/gpio/index.rst +++ b/Documentation/driver-api/gpio/index.rst @@ -8,11 +8,13 @@ Contents: :maxdepth: 2 intro + using-gpio driver consumer board drivers-on-gpio legacy + bt8xxgpio Core ==== diff --git a/Documentation/driver-api/gpio/using-gpio.rst b/Documentation/driver-api/gpio/using-gpio.rst new file mode 100644 index 000000000000..dda069444032 --- /dev/null +++ b/Documentation/driver-api/gpio/using-gpio.rst @@ -0,0 +1,50 @@ +========================= +Using GPIO Lines in Linux +========================= + +The Linux kernel exists to abstract and present hardware to users. GPIO lines +as such are normally not user facing abstractions. The most obvious, natural +and preferred way to use GPIO lines is to let kernel hardware drivers deal +with them. + +For examples of already existing generic drivers that will also be good +examples for any other kernel drivers you want to author, refer to +:doc:`drivers-on-gpio` + +For any kind of mass produced system you want to support, such as servers, +laptops, phones, tablets, routers, and any consumer or office or business goods +using appropriate kernel drivers is paramount. Submit your code for inclusion +in the upstream Linux kernel when you feel it is mature enough and you will get +help to refine it, see :doc:`../../process/submitting-patches`. + +In Linux GPIO lines also have a userspace ABI. + +The userspace ABI is intended for one-off deployments. Examples are prototypes, +factory lines, maker community projects, workshop specimen, production tools, +industrial automation, PLC-type use cases, door controllers, in short a piece +of specialized equipment that is not produced by the numbers, requiring +operators to have a deep knowledge of the equipment and knows about the +software-hardware interface to be set up. They should not have a natural fit +to any existing kernel subsystem and not be a good fit for an operating system, +because of not being reusable or abstract enough, or involving a lot of non +computer hardware related policy. + +Applications that have a good reason to use the industrial I/O (IIO) subsystem +from userspace will likely be a good fit for using GPIO lines from userspace as +well. + +Do not under any circumstances abuse the GPIO userspace ABI to cut corners in +any product development projects. If you use it for prototyping, then do not +productify the prototype: rewrite it using proper kernel drivers. Do not under +any circumstances deploy any uniform products using GPIO from userspace. + +The userspace ABI is a character device for each GPIO hardware unit (GPIO chip). +These devices will appear on the system as ``/dev/gpiochip0`` thru +``/dev/gpiochipN``. Examples of how to directly use the userspace ABI can be +found in the kernel tree ``tools/gpio`` subdirectory. + +For structured and managed applications, we recommend that you make use of the +libgpiod_ library. This provides helper abstractions, command line utlities +and arbitration for multiple simultaneous consumers on the same GPIO chip. + +.. _libgpiod: https://git.kernel.org/pub/scm/libs/libgpiod/libgpiod.git/ diff --git a/Documentation/driver-api/index.rst b/Documentation/driver-api/index.rst index d12a80f386a6..0ebe205efd0c 100644 --- a/Documentation/driver-api/index.rst +++ b/Documentation/driver-api/index.rst @@ -26,6 +26,7 @@ available subsections can be seen below. device_link component message-based + infiniband sound frame-buffer regulator @@ -39,6 +40,7 @@ available subsections can be seen below. ipmb i3c/index interconnect + devfreq hsi edac scsi @@ -65,16 +67,16 @@ available subsections can be seen below. dmaengine/index slimbus soundwire/index + thermal/index fpga/index acpi/index backlight/lp855x-driver.rst - bt8xxgpio connector console dcdbas - dell_rbu edid eisa + ipmb isa isapnp generic-counter @@ -91,7 +93,6 @@ available subsections can be seen below. pwm rfkill serial/index - sgi-ioc4 sm501 smsc_ece1099 switchtec diff --git a/Documentation/driver-api/infiniband.rst b/Documentation/driver-api/infiniband.rst new file mode 100644 index 000000000000..1a3116f32ff0 --- /dev/null +++ b/Documentation/driver-api/infiniband.rst @@ -0,0 +1,127 @@ +=========================================== +InfiniBand and Remote DMA (RDMA) Interfaces +=========================================== + +Introduction and Overview +========================= + +TBD + +InfiniBand core interfaces +========================== + +.. kernel-doc:: drivers/infiniband/core/iwpm_util.h + :internal: + +.. kernel-doc:: drivers/infiniband/core/cq.c + :export: + +.. kernel-doc:: drivers/infiniband/core/cm.c + :export: + +.. kernel-doc:: drivers/infiniband/core/rw.c + :export: + +.. kernel-doc:: drivers/infiniband/core/device.c + :export: + +.. kernel-doc:: drivers/infiniband/core/verbs.c + :export: + +.. kernel-doc:: drivers/infiniband/core/packer.c + :export: + +.. kernel-doc:: drivers/infiniband/core/sa_query.c + :export: + +.. kernel-doc:: drivers/infiniband/core/ud_header.c + :export: + +.. kernel-doc:: drivers/infiniband/core/fmr_pool.c + :export: + +.. kernel-doc:: drivers/infiniband/core/umem.c + :export: + +.. kernel-doc:: drivers/infiniband/core/umem_odp.c + :export: + +RDMA Verbs transport library +============================ + +.. kernel-doc:: drivers/infiniband/sw/rdmavt/mr.c + :export: + +.. kernel-doc:: drivers/infiniband/sw/rdmavt/rc.c + :export: + +.. kernel-doc:: drivers/infiniband/sw/rdmavt/ah.c + :export: + +.. kernel-doc:: drivers/infiniband/sw/rdmavt/vt.c + :export: + +.. kernel-doc:: drivers/infiniband/sw/rdmavt/cq.c + :export: + +.. kernel-doc:: drivers/infiniband/sw/rdmavt/qp.c + :export: + +.. kernel-doc:: drivers/infiniband/sw/rdmavt/mcast.c + :export: + +Upper Layer Protocols +===================== + +iSCSI Extensions for RDMA (iSER) +-------------------------------- + +.. kernel-doc:: drivers/infiniband/ulp/iser/iscsi_iser.h + :internal: + +.. kernel-doc:: drivers/infiniband/ulp/iser/iscsi_iser.c + :functions: iscsi_iser_pdu_alloc iser_initialize_task_headers \ + iscsi_iser_task_init iscsi_iser_mtask_xmit iscsi_iser_task_xmit \ + iscsi_iser_cleanup_task iscsi_iser_check_protection \ + iscsi_iser_conn_create iscsi_iser_conn_bind \ + iscsi_iser_conn_start iscsi_iser_conn_stop \ + iscsi_iser_session_destroy iscsi_iser_session_create \ + iscsi_iser_set_param iscsi_iser_ep_connect iscsi_iser_ep_poll \ + iscsi_iser_ep_disconnect + +.. kernel-doc:: drivers/infiniband/ulp/iser/iser_initiator.c + :internal: + +.. kernel-doc:: drivers/infiniband/ulp/iser/iser_verbs.c + :internal: + +Omni-Path (OPA) Virtual NIC support +----------------------------------- + +.. kernel-doc:: drivers/infiniband/ulp/opa_vnic/opa_vnic_internal.h + :internal: + +.. kernel-doc:: drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.h + :internal: + +.. kernel-doc:: drivers/infiniband/ulp/opa_vnic/opa_vnic_vema_iface.c + :internal: + +.. kernel-doc:: drivers/infiniband/ulp/opa_vnic/opa_vnic_vema.c + :internal: + +InfiniBand SCSI RDMA protocol target support +-------------------------------------------- + +.. kernel-doc:: drivers/infiniband/ulp/srpt/ib_srpt.h + :internal: + +.. kernel-doc:: drivers/infiniband/ulp/srpt/ib_srpt.c + :internal: + +iSCSI Extensions for RDMA (iSER) target support +----------------------------------------------- + +.. kernel-doc:: drivers/infiniband/ulp/isert/ib_isert.c + :internal: + diff --git a/Documentation/driver-api/infrastructure.rst b/Documentation/driver-api/infrastructure.rst index 6172f3cc3d0b..06d98c4526df 100644 --- a/Documentation/driver-api/infrastructure.rst +++ b/Documentation/driver-api/infrastructure.rst @@ -49,9 +49,6 @@ Device Drivers Base Device Drivers DMA Management ----------------------------- -.. kernel-doc:: kernel/dma/coherent.c - :export: - .. kernel-doc:: kernel/dma/mapping.c :export: diff --git a/Documentation/driver-api/interconnect.rst b/Documentation/driver-api/interconnect.rst index c3e004893796..5ed4f57a6bac 100644 --- a/Documentation/driver-api/interconnect.rst +++ b/Documentation/driver-api/interconnect.rst @@ -1,7 +1,7 @@ .. SPDX-License-Identifier: GPL-2.0 ===================================== -GENERIC SYSTEM INTERCONNECT SUBSYSTEM +Generic System Interconnect Subsystem ===================================== Introduction @@ -91,3 +91,25 @@ Interconnect consumers are the clients which use the interconnect APIs to get paths between endpoints and set their bandwidth/latency/QoS requirements for these interconnect paths. These interfaces are not currently documented. + +Interconnect debugfs interfaces +------------------------------- + +Like several other subsystems interconnect will create some files for debugging +and introspection. Files in debugfs are not considered ABI so application +software shouldn't rely on format details change between kernel versions. + +``/sys/kernel/debug/interconnect/interconnect_summary``: + +Show all interconnect nodes in the system with their aggregated bandwidth +request. Indented under each node show bandwidth requests from each device. + +``/sys/kernel/debug/interconnect/interconnect_graph``: + +Show the interconnect graph in the graphviz dot format. It shows all +interconnect nodes and links in the system and groups together nodes from the +same provider as subgraphs. The format is human-readable and can also be piped +through dot to generate diagrams in many graphical formats:: + + $ cat /sys/kernel/debug/interconnect/interconnect_graph | \ + dot -Tsvg > interconnect_graph.svg diff --git a/Documentation/driver-api/ipmb.rst b/Documentation/driver-api/ipmb.rst index 7e2265144157..209c49e05116 100644 --- a/Documentation/driver-api/ipmb.rst +++ b/Documentation/driver-api/ipmb.rst @@ -71,9 +71,13 @@ b) Example for device tree:: ipmb@10 { compatible = "ipmb-dev"; reg = <0x10>; + i2c-protocol; }; }; +If xmit of data to be done using raw i2c block vs smbus +then "i2c-protocol" needs to be defined as above. + 2) Manually from Linux:: modprobe ipmb-dev-int @@ -83,7 +87,7 @@ Instantiate the device ---------------------- After loading the driver, you can instantiate the device as -described in 'Documentation/i2c/instantiating-devices'. +described in 'Documentation/i2c/instantiating-devices.rst'. If you have multiple BMCs, each connected to your Satellite MC via a different I2C bus, you can instantiate a device for each of those BMCs. diff --git a/Documentation/driver-api/libata.rst b/Documentation/driver-api/libata.rst index 70e180e6b93d..207f0d24de69 100644 --- a/Documentation/driver-api/libata.rst +++ b/Documentation/driver-api/libata.rst @@ -250,23 +250,23 @@ High-level taskfile hooks :: - void (*qc_prep) (struct ata_queued_cmd *qc); + enum ata_completion_errors (*qc_prep) (struct ata_queued_cmd *qc); int (*qc_issue) (struct ata_queued_cmd *qc); -Higher-level hooks, these two hooks can potentially supercede several of +Higher-level hooks, these two hooks can potentially supersede several of the above taskfile/DMA engine hooks. ``->qc_prep`` is called after the buffers have been DMA-mapped, and is typically used to populate the -hardware's DMA scatter-gather table. Most drivers use the standard -:c:func:`ata_qc_prep` helper function, but more advanced drivers roll their -own. +hardware's DMA scatter-gather table. Some drivers use the standard +:c:func:`ata_bmdma_qc_prep` and :c:func:`ata_bmdma_dumb_qc_prep` helper +functions, but more advanced drivers roll their own. ``->qc_issue`` is used to make a command active, once the hardware and S/G tables have been prepared. IDE BMDMA drivers use the helper function -:c:func:`ata_qc_issue_prot` for taskfile protocol-based dispatch. More +:c:func:`ata_sff_qc_issue` for taskfile protocol-based dispatch. More advanced drivers implement their own ``->qc_issue``. -:c:func:`ata_qc_issue_prot` calls ``->tf_load()``, ``->bmdma_setup()``, and +:c:func:`ata_sff_qc_issue` calls ``->sff_tf_load()``, ``->bmdma_setup()``, and ``->bmdma_start()`` as necessary to initiate a transfer. Exception and probe handling (EH) diff --git a/Documentation/driver-api/mtd/spi-nor.rst b/Documentation/driver-api/mtd/spi-nor.rst index f5333e3bf486..1f0437676762 100644 --- a/Documentation/driver-api/mtd/spi-nor.rst +++ b/Documentation/driver-api/mtd/spi-nor.rst @@ -59,7 +59,7 @@ Part III - How can drivers use the framework? The main API is spi_nor_scan(). Before you call the hook, a driver should initialize the necessary fields for spi_nor{}. Please see -drivers/mtd/spi-nor/spi-nor.c for detail. Please also refer to fsl-quadspi.c +drivers/mtd/spi-nor/spi-nor.c for detail. Please also refer to spi-fsl-qspi.c when you want to write a new driver for a SPI NOR controller. Another API is spi_nor_restore(), this is used to restore the status of SPI flash chip such as addressing mode. Call it whenever detach the driver from diff --git a/Documentation/driver-api/nvmem.rst b/Documentation/driver-api/nvmem.rst index d9d958d5c824..287e86819640 100644 --- a/Documentation/driver-api/nvmem.rst +++ b/Documentation/driver-api/nvmem.rst @@ -129,6 +129,8 @@ To facilitate such consumers NVMEM framework provides below apis:: struct nvmem_device *nvmem_device_get(struct device *dev, const char *name); struct nvmem_device *devm_nvmem_device_get(struct device *dev, const char *name); + struct nvmem_device *nvmem_device_find(void *data, + int (*match)(struct device *dev, const void *data)); void nvmem_device_put(struct nvmem_device *nvmem); int nvmem_device_read(struct nvmem_device *nvmem, unsigned int offset, size_t bytes, void *buf); diff --git a/Documentation/driver-api/pinctl.rst b/Documentation/driver-api/pinctl.rst index 2bb1bc484278..3d2deaf48841 100644 --- a/Documentation/driver-api/pinctl.rst +++ b/Documentation/driver-api/pinctl.rst @@ -638,8 +638,8 @@ group of pins would work something like this:: } static int foo_get_group_pins(struct pinctrl_dev *pctldev, unsigned selector, - unsigned ** const pins, - unsigned * const num_pins) + const unsigned ** pins, + unsigned * num_pins) { *pins = (unsigned *) foo_groups[selector].pins; *num_pins = foo_groups[selector].num_pins; @@ -705,7 +705,7 @@ group of pins would work something like this:: { u8 regbit = (1 << selector + group); - writeb((readb(MUX)|regbit), MUX) + writeb((readb(MUX)|regbit), MUX); return 0; } diff --git a/Documentation/driver-api/pti_intel_mid.rst b/Documentation/driver-api/pti_intel_mid.rst index 20f1cff42d5f..bacc2a4ee89f 100644 --- a/Documentation/driver-api/pti_intel_mid.rst +++ b/Documentation/driver-api/pti_intel_mid.rst @@ -49,7 +49,9 @@ but is not just blindly executing as 'root'. Keep in mind the use of ioctl(,TIOCSETD,) is not specific to the n_tracerouter and n_tracesink line discpline drivers but is a generic operation for a program to use a line discpline driver -on a tty port other than the default n_tty:: +on a tty port other than the default n_tty: + +.. code-block:: c /////////// To hook up n_tracerouter and n_tracesink ///////// diff --git a/Documentation/driver-api/serial/n_gsm.rst b/Documentation/driver-api/serial/n_gsm.rst index f3ad9fd26408..286e7ff4d2d9 100644 --- a/Documentation/driver-api/serial/n_gsm.rst +++ b/Documentation/driver-api/serial/n_gsm.rst @@ -18,18 +18,22 @@ How to use it 2. switch the serial line to using the n_gsm line discipline by using TIOCSETD ioctl, 3. configure the mux using GSMIOC_GETCONF / GSMIOC_SETCONF ioctl, +4. obtain base gsmtty number for the used serial port, Major parts of the initialization program : (a good starting point is util-linux-ng/sys-utils/ldattach.c):: + #include <stdio.h> + #include <stdint.h> #include <linux/gsmmux.h> - #define N_GSM0710 21 /* GSM 0710 Mux */ + #include <linux/tty.h> #define DEFAULT_SPEED B115200 #define SERIAL_PORT /dev/ttyS0 int ldisc = N_GSM0710; struct gsm_config c; struct termios configuration; + uint32_t first; /* open the serial port connected to the modem */ fd = open(SERIAL_PORT, O_RDWR | O_NOCTTY | O_NDELAY); @@ -58,21 +62,14 @@ Major parts of the initialization program : c.mtu = 127; /* set the new configuration */ ioctl(fd, GSMIOC_SETCONF, &c); + /* get first gsmtty device node */ + ioctl(fd, GSMIOC_GETFIRST, &first); + printf("first muxed line: /dev/gsmtty%i\n", first); /* and wait for ever to keep the line discipline enabled */ daemon(0,0); pause(); -4. create the devices corresponding to the "virtual" serial ports (take care, - each modem has its configuration and some DLC have dedicated functions, - for example GPS), starting with minor 1 (DLC0 is reserved for the management - of the mux):: - - MAJOR=`cat /proc/devices |grep gsmtty | awk '{print $1}` - for i in `seq 1 4`; do - mknod /dev/ttygsm$i c $MAJOR $i - done - 5. use these devices as plain serial ports. for example, it's possible: diff --git a/Documentation/driver-api/sgi-ioc4.rst b/Documentation/driver-api/sgi-ioc4.rst deleted file mode 100644 index 72709222d3c0..000000000000 --- a/Documentation/driver-api/sgi-ioc4.rst +++ /dev/null @@ -1,49 +0,0 @@ -==================================== -SGI IOC4 PCI (multi function) device -==================================== - -The SGI IOC4 PCI device is a bit of a strange beast, so some notes on -it are in order. - -First, even though the IOC4 performs multiple functions, such as an -IDE controller, a serial controller, a PS/2 keyboard/mouse controller, -and an external interrupt mechanism, it's not implemented as a -multifunction device. The consequence of this from a software -standpoint is that all these functions share a single IRQ, and -they can't all register to own the same PCI device ID. To make -matters a bit worse, some of the register blocks (and even registers -themselves) present in IOC4 are mixed-purpose between these several -functions, meaning that there's no clear "owning" device driver. - -The solution is to organize the IOC4 driver into several independent -drivers, "ioc4", "sgiioc4", and "ioc4_serial". Note that there is no -PS/2 controller driver as this functionality has never been wired up -on a shipping IO card. - -ioc4 -==== -This is the core (or shim) driver for IOC4. It is responsible for -initializing the basic functionality of the chip, and allocating -the PCI resources that are shared between the IOC4 functions. - -This driver also provides registration functions that the other -IOC4 drivers can call to make their presence known. Each driver -needs to provide a probe and remove function, which are invoked -by the core driver at appropriate times. The interface of these -IOC4 function probe and remove operations isn't precisely the same -as PCI device probe and remove operations, but is logically the -same operation. - -sgiioc4 -======= -This is the IDE driver for IOC4. Its name isn't very descriptive -simply for historical reasons (it used to be the only IOC4 driver -component). There's not much to say about it other than it hooks -up to the ioc4 driver via the appropriate registration, probe, and -remove functions. - -ioc4_serial -=========== -This is the serial driver for IOC4. There's not much to say about it -other than it hooks up to the ioc4 driver via the appropriate registration, -probe, and remove functions. diff --git a/Documentation/driver-api/soundwire/index.rst b/Documentation/driver-api/soundwire/index.rst index 6db026028f27..234911a0db99 100644 --- a/Documentation/driver-api/soundwire/index.rst +++ b/Documentation/driver-api/soundwire/index.rst @@ -10,7 +10,7 @@ SoundWire Documentation error_handling locking -.. only:: subproject +.. only:: subproject and html Indices ======= diff --git a/Documentation/driver-api/thermal/cpu-cooling-api.rst b/Documentation/driver-api/thermal/cpu-cooling-api.rst new file mode 100644 index 000000000000..645d914c45a6 --- /dev/null +++ b/Documentation/driver-api/thermal/cpu-cooling-api.rst @@ -0,0 +1,107 @@ +======================= +CPU cooling APIs How To +======================= + +Written by Amit Daniel Kachhap <amit.kachhap@linaro.org> + +Updated: 6 Jan 2015 + +Copyright (c) 2012 Samsung Electronics Co., Ltd(http://www.samsung.com) + +0. Introduction +=============== + +The generic cpu cooling(freq clipping) provides registration/unregistration APIs +to the caller. The binding of the cooling devices to the trip point is left for +the user. The registration APIs returns the cooling device pointer. + +1. cpu cooling APIs +=================== + +1.1 cpufreq registration/unregistration APIs +-------------------------------------------- + + :: + + struct thermal_cooling_device + *cpufreq_cooling_register(struct cpumask *clip_cpus) + + This interface function registers the cpufreq cooling device with the name + "thermal-cpufreq-%x". This api can support multiple instances of cpufreq + cooling devices. + + clip_cpus: + cpumask of cpus where the frequency constraints will happen. + + :: + + struct thermal_cooling_device + *of_cpufreq_cooling_register(struct cpufreq_policy *policy) + + This interface function registers the cpufreq cooling device with + the name "thermal-cpufreq-%x" linking it with a device tree node, in + order to bind it via the thermal DT code. This api can support multiple + instances of cpufreq cooling devices. + + policy: + CPUFreq policy. + + + :: + + void cpufreq_cooling_unregister(struct thermal_cooling_device *cdev) + + This interface function unregisters the "thermal-cpufreq-%x" cooling device. + + cdev: Cooling device pointer which has to be unregistered. + +2. Power models +=============== + +The power API registration functions provide a simple power model for +CPUs. The current power is calculated as dynamic power (static power isn't +supported currently). This power model requires that the operating-points of +the CPUs are registered using the kernel's opp library and the +`cpufreq_frequency_table` is assigned to the `struct device` of the +cpu. If you are using CONFIG_CPUFREQ_DT then the +`cpufreq_frequency_table` should already be assigned to the cpu +device. + +The dynamic power consumption of a processor depends on many factors. +For a given processor implementation the primary factors are: + +- The time the processor spends running, consuming dynamic power, as + compared to the time in idle states where dynamic consumption is + negligible. Herein we refer to this as 'utilisation'. +- The voltage and frequency levels as a result of DVFS. The DVFS + level is a dominant factor governing power consumption. +- In running time the 'execution' behaviour (instruction types, memory + access patterns and so forth) causes, in most cases, a second order + variation. In pathological cases this variation can be significant, + but typically it is of a much lesser impact than the factors above. + +A high level dynamic power consumption model may then be represented as:: + + Pdyn = f(run) * Voltage^2 * Frequency * Utilisation + +f(run) here represents the described execution behaviour and its +result has a units of Watts/Hz/Volt^2 (this often expressed in +mW/MHz/uVolt^2) + +The detailed behaviour for f(run) could be modelled on-line. However, +in practice, such an on-line model has dependencies on a number of +implementation specific processor support and characterisation +factors. Therefore, in initial implementation that contribution is +represented as a constant coefficient. This is a simplification +consistent with the relative contribution to overall power variation. + +In this simplified representation our model becomes:: + + Pdyn = Capacitance * Voltage^2 * Frequency * Utilisation + +Where `capacitance` is a constant that represents an indicative +running time dynamic power coefficient in fundamental units of +mW/MHz/uVolt^2. Typical values for mobile CPUs might lie in range +from 100 to 500. For reference, the approximate values for the SoC in +ARM's Juno Development Platform are 530 for the Cortex-A57 cluster and +140 for the Cortex-A53 cluster. diff --git a/Documentation/driver-api/thermal/cpu-idle-cooling.rst b/Documentation/driver-api/thermal/cpu-idle-cooling.rst new file mode 100644 index 000000000000..9f0016ee4cfb --- /dev/null +++ b/Documentation/driver-api/thermal/cpu-idle-cooling.rst @@ -0,0 +1,194 @@ + +Situation: +---------- + +Under certain circumstances a SoC can reach a critical temperature +limit and is unable to stabilize the temperature around a temperature +control. When the SoC has to stabilize the temperature, the kernel can +act on a cooling device to mitigate the dissipated power. When the +critical temperature is reached, a decision must be taken to reduce +the temperature, that, in turn impacts performance. + +Another situation is when the silicon temperature continues to +increase even after the dynamic leakage is reduced to its minimum by +clock gating the component. This runaway phenomenon can continue due +to the static leakage. The only solution is to power down the +component, thus dropping the dynamic and static leakage that will +allow the component to cool down. + +Last but not least, the system can ask for a specific power budget but +because of the OPP density, we can only choose an OPP with a power +budget lower than the requested one and under-utilize the CPU, thus +losing performance. In other words, one OPP under-utilizes the CPU +with a power less than the requested power budget and the next OPP +exceeds the power budget. An intermediate OPP could have been used if +it were present. + +Solutions: +---------- + +If we can remove the static and the dynamic leakage for a specific +duration in a controlled period, the SoC temperature will +decrease. Acting on the idle state duration or the idle cycle +injection period, we can mitigate the temperature by modulating the +power budget. + +The Operating Performance Point (OPP) density has a great influence on +the control precision of cpufreq, however different vendors have a +plethora of OPP density, and some have large power gap between OPPs, +that will result in loss of performance during thermal control and +loss of power in other scenarios. + +At a specific OPP, we can assume that injecting idle cycle on all CPUs +belong to the same cluster, with a duration greater than the cluster +idle state target residency, we lead to dropping the static and the +dynamic leakage for this period (modulo the energy needed to enter +this state). So the sustainable power with idle cycles has a linear +relation with the OPP’s sustainable power and can be computed with a +coefficient similar to: + + Power(IdleCycle) = Coef x Power(OPP) + +Idle Injection: +--------------- + +The base concept of the idle injection is to force the CPU to go to an +idle state for a specified time each control cycle, it provides +another way to control CPU power and heat in addition to +cpufreq. Ideally, if all CPUs belonging to the same cluster, inject +their idle cycles synchronously, the cluster can reach its power down +state with a minimum power consumption and reduce the static leakage +to almost zero. However, these idle cycles injection will add extra +latencies as the CPUs will have to wakeup from a deep sleep state. + +We use a fixed duration of idle injection that gives an acceptable +performance penalty and a fixed latency. Mitigation can be increased +or decreased by modulating the duty cycle of the idle injection. + +:: + + ^ + | + | + |------- ------- + |_______|_______________________|_______|___________ + + <------> + idle <----------------------> + running + + <-----------------------------> + duty cycle 25% + + +The implementation of the cooling device bases the number of states on +the duty cycle percentage. When no mitigation is happening the cooling +device state is zero, meaning the duty cycle is 0%. + +When the mitigation begins, depending on the governor's policy, a +starting state is selected. With a fixed idle duration and the duty +cycle (aka the cooling device state), the running duration can be +computed. + +The governor will change the cooling device state thus the duty cycle +and this variation will modulate the cooling effect. + +:: + + ^ + | + | + |------- ------- + |_______|_______________|_______|___________ + + <------> + idle <--------------> + running + + <-----------------------------> + duty cycle 33% + + + ^ + | + | + |------- ------- + |_______|_______|_______|___________ + + <------> + idle <------> + running + + <-------------> + duty cycle 50% + +The idle injection duration value must comply with the constraints: + +- It is less than or equal to the latency we tolerate when the + mitigation begins. It is platform dependent and will depend on the + user experience, reactivity vs performance trade off we want. This + value should be specified. + +- It is greater than the idle state’s target residency we want to go + for thermal mitigation, otherwise we end up consuming more energy. + +Power considerations +-------------------- + +When we reach the thermal trip point, we have to sustain a specified +power for a specific temperature but at this time we consume: + + Power = Capacitance x Voltage^2 x Frequency x Utilisation + +... which is more than the sustainable power (or there is something +wrong in the system setup). The ‘Capacitance’ and ‘Utilisation’ are a +fixed value, ‘Voltage’ and the ‘Frequency’ are fixed artificially +because we don’t want to change the OPP. We can group the +‘Capacitance’ and the ‘Utilisation’ into a single term which is the +‘Dynamic Power Coefficient (Cdyn)’ Simplifying the above, we have: + + Pdyn = Cdyn x Voltage^2 x Frequency + +The power allocator governor will ask us somehow to reduce our power +in order to target the sustainable power defined in the device +tree. So with the idle injection mechanism, we want an average power +(Ptarget) resulting in an amount of time running at full power on a +specific OPP and idle another amount of time. That could be put in a +equation: + + P(opp)target = ((Trunning x (P(opp)running) + (Tidle x P(opp)idle)) / + (Trunning + Tidle) + + ... + + Tidle = Trunning x ((P(opp)running / P(opp)target) - 1) + +At this point if we know the running period for the CPU, that gives us +the idle injection we need. Alternatively if we have the idle +injection duration, we can compute the running duration with: + + Trunning = Tidle / ((P(opp)running / P(opp)target) - 1) + +Practically, if the running power is less than the targeted power, we +end up with a negative time value, so obviously the equation usage is +bound to a power reduction, hence a higher OPP is needed to have the +running power greater than the targeted power. + +However, in this demonstration we ignore three aspects: + + * The static leakage is not defined here, we can introduce it in the + equation but assuming it will be zero most of the time as it is + difficult to get the values from the SoC vendors + + * The idle state wake up latency (or entry + exit latency) is not + taken into account, it must be added in the equation in order to + rigorously compute the idle injection + + * The injected idle duration must be greater than the idle state + target residency, otherwise we end up consuming more energy and + potentially invert the mitigation effect + +So the final equation is: + + Trunning = (Tidle - Twakeup ) x + (((P(opp)dyn + P(opp)static ) - P(opp)target) / P(opp)target ) diff --git a/Documentation/driver-api/thermal/exynos_thermal.rst b/Documentation/driver-api/thermal/exynos_thermal.rst new file mode 100644 index 000000000000..764df4ab584d --- /dev/null +++ b/Documentation/driver-api/thermal/exynos_thermal.rst @@ -0,0 +1,90 @@ +======================== +Kernel driver exynos_tmu +======================== + +Supported chips: + +* ARM Samsung Exynos4, Exynos5 series of SoC + + Datasheet: Not publicly available + +Authors: Donggeun Kim <dg77.kim@samsung.com> +Authors: Amit Daniel <amit.daniel@samsung.com> + +TMU controller Description: +--------------------------- + +This driver allows to read temperature inside Samsung Exynos4/5 series of SoC. + +The chip only exposes the measured 8-bit temperature code value +through a register. +Temperature can be taken from the temperature code. +There are three equations converting from temperature to temperature code. + +The three equations are: + 1. Two point trimming:: + + Tc = (T - 25) * (TI2 - TI1) / (85 - 25) + TI1 + + 2. One point trimming:: + + Tc = T + TI1 - 25 + + 3. No trimming:: + + Tc = T + 50 + + Tc: + Temperature code, T: Temperature, + TI1: + Trimming info for 25 degree Celsius (stored at TRIMINFO register) + Temperature code measured at 25 degree Celsius which is unchanged + TI2: + Trimming info for 85 degree Celsius (stored at TRIMINFO register) + Temperature code measured at 85 degree Celsius which is unchanged + +TMU(Thermal Management Unit) in Exynos4/5 generates interrupt +when temperature exceeds pre-defined levels. +The maximum number of configurable threshold is five. +The threshold levels are defined as follows:: + + Level_0: current temperature > trigger_level_0 + threshold + Level_1: current temperature > trigger_level_1 + threshold + Level_2: current temperature > trigger_level_2 + threshold + Level_3: current temperature > trigger_level_3 + threshold + +The threshold and each trigger_level are set +through the corresponding registers. + +When an interrupt occurs, this driver notify kernel thermal framework +with the function exynos_report_trigger. +Although an interrupt condition for level_0 can be set, +it can be used to synchronize the cooling action. + +TMU driver description: +----------------------- + +The exynos thermal driver is structured as:: + + Kernel Core thermal framework + (thermal_core.c, step_wise.c, cpufreq_cooling.c) + ^ + | + | + TMU configuration data -----> TMU Driver <----> Exynos Core thermal wrapper + (exynos_tmu_data.c) (exynos_tmu.c) (exynos_thermal_common.c) + (exynos_tmu_data.h) (exynos_tmu.h) (exynos_thermal_common.h) + +a) TMU configuration data: + This consist of TMU register offsets/bitfields + described through structure exynos_tmu_registers. Also several + other platform data (struct exynos_tmu_platform_data) members + are used to configure the TMU. +b) TMU driver: + This component initialises the TMU controller and sets different + thresholds. It invokes core thermal implementation with the call + exynos_report_trigger. +c) Exynos Core thermal wrapper: + This provides 3 wrapper function to use the + Kernel core thermal framework. They are exynos_unregister_thermal, + exynos_register_thermal and exynos_report_trigger. diff --git a/Documentation/driver-api/thermal/exynos_thermal_emulation.rst b/Documentation/driver-api/thermal/exynos_thermal_emulation.rst new file mode 100644 index 000000000000..c21d10838bc5 --- /dev/null +++ b/Documentation/driver-api/thermal/exynos_thermal_emulation.rst @@ -0,0 +1,61 @@ +===================== +Exynos Emulation Mode +===================== + +Copyright (C) 2012 Samsung Electronics + +Written by Jonghwa Lee <jonghwa3.lee@samsung.com> + +Description +----------- + +Exynos 4x12 (4212, 4412) and 5 series provide emulation mode for thermal +management unit. Thermal emulation mode supports software debug for +TMU's operation. User can set temperature manually with software code +and TMU will read current temperature from user value not from sensor's +value. + +Enabling CONFIG_THERMAL_EMULATION option will make this support +available. When it's enabled, sysfs node will be created as +/sys/devices/virtual/thermal/thermal_zone'zone id'/emul_temp. + +The sysfs node, 'emul_node', will contain value 0 for the initial state. +When you input any temperature you want to update to sysfs node, it +automatically enable emulation mode and current temperature will be +changed into it. + +(Exynos also supports user changeable delay time which would be used to +delay of changing temperature. However, this node only uses same delay +of real sensing time, 938us.) + +Exynos emulation mode requires synchronous of value changing and +enabling. It means when you want to update the any value of delay or +next temperature, then you have to enable emulation mode at the same +time. (Or you have to keep the mode enabling.) If you don't, it fails to +change the value to updated one and just use last succeessful value +repeatedly. That's why this node gives users the right to change +termerpature only. Just one interface makes it more simply to use. + +Disabling emulation mode only requires writing value 0 to sysfs node. + +:: + + + TEMP 120 | + | + 100 | + | + 80 | + | +----------- + 60 | | | + | +-------------| | + 40 | | | | + | | | | + 20 | | | +---------- + | | | | | + 0 |______________|_____________|__________|__________|_________ + A A A A TIME + |<----->| |<----->| |<----->| | + | 938us | | | | | | + emulation : 0 50 | 70 | 20 | 0 + current temp: sensor 50 70 20 sensor diff --git a/Documentation/driver-api/thermal/index.rst b/Documentation/driver-api/thermal/index.rst new file mode 100644 index 000000000000..5ba61d19c6ae --- /dev/null +++ b/Documentation/driver-api/thermal/index.rst @@ -0,0 +1,18 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======= +Thermal +======= + +.. toctree:: + :maxdepth: 1 + + cpu-cooling-api + sysfs-api + power_allocator + + exynos_thermal + exynos_thermal_emulation + intel_powerclamp + nouveau_thermal + x86_pkg_temperature_thermal diff --git a/Documentation/driver-api/thermal/intel_powerclamp.rst b/Documentation/driver-api/thermal/intel_powerclamp.rst new file mode 100644 index 000000000000..3f6dfb0b3ea6 --- /dev/null +++ b/Documentation/driver-api/thermal/intel_powerclamp.rst @@ -0,0 +1,320 @@ +======================= +Intel Powerclamp Driver +======================= + +By: + - Arjan van de Ven <arjan@linux.intel.com> + - Jacob Pan <jacob.jun.pan@linux.intel.com> + +.. Contents: + + (*) Introduction + - Goals and Objectives + + (*) Theory of Operation + - Idle Injection + - Calibration + + (*) Performance Analysis + - Effectiveness and Limitations + - Power vs Performance + - Scalability + - Calibration + - Comparison with Alternative Techniques + + (*) Usage and Interfaces + - Generic Thermal Layer (sysfs) + - Kernel APIs (TBD) + +INTRODUCTION +============ + +Consider the situation where a system’s power consumption must be +reduced at runtime, due to power budget, thermal constraint, or noise +level, and where active cooling is not preferred. Software managed +passive power reduction must be performed to prevent the hardware +actions that are designed for catastrophic scenarios. + +Currently, P-states, T-states (clock modulation), and CPU offlining +are used for CPU throttling. + +On Intel CPUs, C-states provide effective power reduction, but so far +they’re only used opportunistically, based on workload. With the +development of intel_powerclamp driver, the method of synchronizing +idle injection across all online CPU threads was introduced. The goal +is to achieve forced and controllable C-state residency. + +Test/Analysis has been made in the areas of power, performance, +scalability, and user experience. In many cases, clear advantage is +shown over taking the CPU offline or modulating the CPU clock. + + +THEORY OF OPERATION +=================== + +Idle Injection +-------------- + +On modern Intel processors (Nehalem or later), package level C-state +residency is available in MSRs, thus also available to the kernel. + +These MSRs are:: + + #define MSR_PKG_C2_RESIDENCY 0x60D + #define MSR_PKG_C3_RESIDENCY 0x3F8 + #define MSR_PKG_C6_RESIDENCY 0x3F9 + #define MSR_PKG_C7_RESIDENCY 0x3FA + +If the kernel can also inject idle time to the system, then a +closed-loop control system can be established that manages package +level C-state. The intel_powerclamp driver is conceived as such a +control system, where the target set point is a user-selected idle +ratio (based on power reduction), and the error is the difference +between the actual package level C-state residency ratio and the target idle +ratio. + +Injection is controlled by high priority kernel threads, spawned for +each online CPU. + +These kernel threads, with SCHED_FIFO class, are created to perform +clamping actions of controlled duty ratio and duration. Each per-CPU +thread synchronizes its idle time and duration, based on the rounding +of jiffies, so accumulated errors can be prevented to avoid a jittery +effect. Threads are also bound to the CPU such that they cannot be +migrated, unless the CPU is taken offline. In this case, threads +belong to the offlined CPUs will be terminated immediately. + +Running as SCHED_FIFO and relatively high priority, also allows such +scheme to work for both preemptable and non-preemptable kernels. +Alignment of idle time around jiffies ensures scalability for HZ +values. This effect can be better visualized using a Perf timechart. +The following diagram shows the behavior of kernel thread +kidle_inject/cpu. During idle injection, it runs monitor/mwait idle +for a given "duration", then relinquishes the CPU to other tasks, +until the next time interval. + +The NOHZ schedule tick is disabled during idle time, but interrupts +are not masked. Tests show that the extra wakeups from scheduler tick +have a dramatic impact on the effectiveness of the powerclamp driver +on large scale systems (Westmere system with 80 processors). + +:: + + CPU0 + ____________ ____________ + kidle_inject/0 | sleep | mwait | sleep | + _________| |________| |_______ + duration + CPU1 + ____________ ____________ + kidle_inject/1 | sleep | mwait | sleep | + _________| |________| |_______ + ^ + | + | + roundup(jiffies, interval) + +Only one CPU is allowed to collect statistics and update global +control parameters. This CPU is referred to as the controlling CPU in +this document. The controlling CPU is elected at runtime, with a +policy that favors BSP, taking into account the possibility of a CPU +hot-plug. + +In terms of dynamics of the idle control system, package level idle +time is considered largely as a non-causal system where its behavior +cannot be based on the past or current input. Therefore, the +intel_powerclamp driver attempts to enforce the desired idle time +instantly as given input (target idle ratio). After injection, +powerclamp monitors the actual idle for a given time window and adjust +the next injection accordingly to avoid over/under correction. + +When used in a causal control system, such as a temperature control, +it is up to the user of this driver to implement algorithms where +past samples and outputs are included in the feedback. For example, a +PID-based thermal controller can use the powerclamp driver to +maintain a desired target temperature, based on integral and +derivative gains of the past samples. + + + +Calibration +----------- +During scalability testing, it is observed that synchronized actions +among CPUs become challenging as the number of cores grows. This is +also true for the ability of a system to enter package level C-states. + +To make sure the intel_powerclamp driver scales well, online +calibration is implemented. The goals for doing such a calibration +are: + +a) determine the effective range of idle injection ratio +b) determine the amount of compensation needed at each target ratio + +Compensation to each target ratio consists of two parts: + + a) steady state error compensation + This is to offset the error occurring when the system can + enter idle without extra wakeups (such as external interrupts). + + b) dynamic error compensation + When an excessive amount of wakeups occurs during idle, an + additional idle ratio can be added to quiet interrupts, by + slowing down CPU activities. + +A debugfs file is provided for the user to examine compensation +progress and results, such as on a Westmere system:: + + [jacob@nex01 ~]$ cat + /sys/kernel/debug/intel_powerclamp/powerclamp_calib + controlling cpu: 0 + pct confidence steady dynamic (compensation) + 0 0 0 0 + 1 1 0 0 + 2 1 1 0 + 3 3 1 0 + 4 3 1 0 + 5 3 1 0 + 6 3 1 0 + 7 3 1 0 + 8 3 1 0 + ... + 30 3 2 0 + 31 3 2 0 + 32 3 1 0 + 33 3 2 0 + 34 3 1 0 + 35 3 2 0 + 36 3 1 0 + 37 3 2 0 + 38 3 1 0 + 39 3 2 0 + 40 3 3 0 + 41 3 1 0 + 42 3 2 0 + 43 3 1 0 + 44 3 1 0 + 45 3 2 0 + 46 3 3 0 + 47 3 0 0 + 48 3 2 0 + 49 3 3 0 + +Calibration occurs during runtime. No offline method is available. +Steady state compensation is used only when confidence levels of all +adjacent ratios have reached satisfactory level. A confidence level +is accumulated based on clean data collected at runtime. Data +collected during a period without extra interrupts is considered +clean. + +To compensate for excessive amounts of wakeup during idle, additional +idle time is injected when such a condition is detected. Currently, +we have a simple algorithm to double the injection ratio. A possible +enhancement might be to throttle the offending IRQ, such as delaying +EOI for level triggered interrupts. But it is a challenge to be +non-intrusive to the scheduler or the IRQ core code. + + +CPU Online/Offline +------------------ +Per-CPU kernel threads are started/stopped upon receiving +notifications of CPU hotplug activities. The intel_powerclamp driver +keeps track of clamping kernel threads, even after they are migrated +to other CPUs, after a CPU offline event. + + +Performance Analysis +==================== +This section describes the general performance data collected on +multiple systems, including Westmere (80P) and Ivy Bridge (4P, 8P). + +Effectiveness and Limitations +----------------------------- +The maximum range that idle injection is allowed is capped at 50 +percent. As mentioned earlier, since interrupts are allowed during +forced idle time, excessive interrupts could result in less +effectiveness. The extreme case would be doing a ping -f to generated +flooded network interrupts without much CPU acknowledgement. In this +case, little can be done from the idle injection threads. In most +normal cases, such as scp a large file, applications can be throttled +by the powerclamp driver, since slowing down the CPU also slows down +network protocol processing, which in turn reduces interrupts. + +When control parameters change at runtime by the controlling CPU, it +may take an additional period for the rest of the CPUs to catch up +with the changes. During this time, idle injection is out of sync, +thus not able to enter package C- states at the expected ratio. But +this effect is minor, in that in most cases change to the target +ratio is updated much less frequently than the idle injection +frequency. + +Scalability +----------- +Tests also show a minor, but measurable, difference between the 4P/8P +Ivy Bridge system and the 80P Westmere server under 50% idle ratio. +More compensation is needed on Westmere for the same amount of +target idle ratio. The compensation also increases as the idle ratio +gets larger. The above reason constitutes the need for the +calibration code. + +On the IVB 8P system, compared to an offline CPU, powerclamp can +achieve up to 40% better performance per watt. (measured by a spin +counter summed over per CPU counting threads spawned for all running +CPUs). + +Usage and Interfaces +==================== +The powerclamp driver is registered to the generic thermal layer as a +cooling device. Currently, it’s not bound to any thermal zones:: + + jacob@chromoly:/sys/class/thermal/cooling_device14$ grep . * + cur_state:0 + max_state:50 + type:intel_powerclamp + +cur_state allows user to set the desired idle percentage. Writing 0 to +cur_state will stop idle injection. Writing a value between 1 and +max_state will start the idle injection. Reading cur_state returns the +actual and current idle percentage. This may not be the same value +set by the user in that current idle percentage depends on workload +and includes natural idle. When idle injection is disabled, reading +cur_state returns value -1 instead of 0 which is to avoid confusing +100% busy state with the disabled state. + +Example usage: +- To inject 25% idle time:: + + $ sudo sh -c "echo 25 > /sys/class/thermal/cooling_device80/cur_state + +If the system is not busy and has more than 25% idle time already, +then the powerclamp driver will not start idle injection. Using Top +will not show idle injection kernel threads. + +If the system is busy (spin test below) and has less than 25% natural +idle time, powerclamp kernel threads will do idle injection. Forced +idle time is accounted as normal idle in that common code path is +taken as the idle task. + +In this example, 24.1% idle is shown. This helps the system admin or +user determine the cause of slowdown, when a powerclamp driver is in action:: + + + Tasks: 197 total, 1 running, 196 sleeping, 0 stopped, 0 zombie + Cpu(s): 71.2%us, 4.7%sy, 0.0%ni, 24.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st + Mem: 3943228k total, 1689632k used, 2253596k free, 74960k buffers + Swap: 4087804k total, 0k used, 4087804k free, 945336k cached + + PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND + 3352 jacob 20 0 262m 644 428 S 286 0.0 0:17.16 spin + 3341 root -51 0 0 0 0 D 25 0.0 0:01.62 kidle_inject/0 + 3344 root -51 0 0 0 0 D 25 0.0 0:01.60 kidle_inject/3 + 3342 root -51 0 0 0 0 D 25 0.0 0:01.61 kidle_inject/1 + 3343 root -51 0 0 0 0 D 25 0.0 0:01.60 kidle_inject/2 + 2935 jacob 20 0 696m 125m 35m S 5 3.3 0:31.11 firefox + 1546 root 20 0 158m 20m 6640 S 3 0.5 0:26.97 Xorg + 2100 jacob 20 0 1223m 88m 30m S 3 2.3 0:23.68 compiz + +Tests have shown that by using the powerclamp driver as a cooling +device, a PID based userspace thermal controller can manage to +control CPU temperature effectively, when no other thermal influence +is added. For example, a UltraBook user can compile the kernel under +certain temperature (below most active trip points). diff --git a/Documentation/driver-api/thermal/nouveau_thermal.rst b/Documentation/driver-api/thermal/nouveau_thermal.rst new file mode 100644 index 000000000000..37255fd6735d --- /dev/null +++ b/Documentation/driver-api/thermal/nouveau_thermal.rst @@ -0,0 +1,96 @@ +===================== +Kernel driver nouveau +===================== + +Supported chips: + +* NV43+ + +Authors: Martin Peres (mupuf) <martin.peres@free.fr> + +Description +----------- + +This driver allows to read the GPU core temperature, drive the GPU fan and +set temperature alarms. + +Currently, due to the absence of in-kernel API to access HWMON drivers, Nouveau +cannot access any of the i2c external monitoring chips it may find. If you +have one of those, temperature and/or fan management through Nouveau's HWMON +interface is likely not to work. This document may then not cover your situation +entirely. + +Temperature management +---------------------- + +Temperature is exposed under as a read-only HWMON attribute temp1_input. + +In order to protect the GPU from overheating, Nouveau supports 4 configurable +temperature thresholds: + + * Fan_boost: + Fan speed is set to 100% when reaching this temperature; + * Downclock: + The GPU will be downclocked to reduce its power dissipation; + * Critical: + The GPU is put on hold to further lower power dissipation; + * Shutdown: + Shut the computer down to protect your GPU. + +WARNING: + Some of these thresholds may not be used by Nouveau depending + on your chipset. + +The default value for these thresholds comes from the GPU's vbios. These +thresholds can be configured thanks to the following HWMON attributes: + + * Fan_boost: temp1_auto_point1_temp and temp1_auto_point1_temp_hyst; + * Downclock: temp1_max and temp1_max_hyst; + * Critical: temp1_crit and temp1_crit_hyst; + * Shutdown: temp1_emergency and temp1_emergency_hyst. + +NOTE: Remember that the values are stored as milli degrees Celsius. Don't forget +to multiply! + +Fan management +-------------- + +Not all cards have a drivable fan. If you do, then the following HWMON +attributes should be available: + + * pwm1_enable: + Current fan management mode (NONE, MANUAL or AUTO); + * pwm1: + Current PWM value (power percentage); + * pwm1_min: + The minimum PWM speed allowed; + * pwm1_max: + The maximum PWM speed allowed (bypassed when hitting Fan_boost); + +You may also have the following attribute: + + * fan1_input: + Speed in RPM of your fan. + +Your fan can be driven in different modes: + + * 0: The fan is left untouched; + * 1: The fan can be driven in manual (use pwm1 to change the speed); + * 2; The fan is driven automatically depending on the temperature. + +NOTE: + Be sure to use the manual mode if you want to drive the fan speed manually + +NOTE2: + When operating in manual mode outside the vbios-defined + [PWM_min, PWM_max] range, the reported fan speed (RPM) may not be accurate + depending on your hardware. + +Bug reports +----------- + +Thermal management on Nouveau is new and may not work on all cards. If you have +inquiries, please ping mupuf on IRC (#nouveau, freenode). + +Bug reports should be filled on Freedesktop's bug tracker. Please follow +http://nouveau.freedesktop.org/wiki/Bugs diff --git a/Documentation/driver-api/thermal/power_allocator.rst b/Documentation/driver-api/thermal/power_allocator.rst new file mode 100644 index 000000000000..67b6a3297238 --- /dev/null +++ b/Documentation/driver-api/thermal/power_allocator.rst @@ -0,0 +1,271 @@ +================================= +Power allocator governor tunables +================================= + +Trip points +----------- + +The governor works optimally with the following two passive trip points: + +1. "switch on" trip point: temperature above which the governor + control loop starts operating. This is the first passive trip + point of the thermal zone. + +2. "desired temperature" trip point: it should be higher than the + "switch on" trip point. This the target temperature the governor + is controlling for. This is the last passive trip point of the + thermal zone. + +PID Controller +-------------- + +The power allocator governor implements a +Proportional-Integral-Derivative controller (PID controller) with +temperature as the control input and power as the controlled output: + + P_max = k_p * e + k_i * err_integral + k_d * diff_err + sustainable_power + +where + - e = desired_temperature - current_temperature + - err_integral is the sum of previous errors + - diff_err = e - previous_error + +It is similar to the one depicted below:: + + k_d + | + current_temp | + | v + | +----------+ +---+ + | +----->| diff_err |-->| X |------+ + | | +----------+ +---+ | + | | | tdp actor + | | k_i | | get_requested_power() + | | | | | | | + | | | | | | | ... + v | v v v v v + +---+ | +-------+ +---+ +---+ +---+ +----------+ + | S |-----+----->| sum e |----->| X |--->| S |-->| S |-->|power | + +---+ | +-------+ +---+ +---+ +---+ |allocation| + ^ | ^ +----------+ + | | | | | + | | +---+ | | | + | +------->| X |-------------------+ v v + | +---+ granted performance + desired_temperature ^ + | + | + k_po/k_pu + +Sustainable power +----------------- + +An estimate of the sustainable dissipatable power (in mW) should be +provided while registering the thermal zone. This estimates the +sustained power that can be dissipated at the desired control +temperature. This is the maximum sustained power for allocation at +the desired maximum temperature. The actual sustained power can vary +for a number of reasons. The closed loop controller will take care of +variations such as environmental conditions, and some factors related +to the speed-grade of the silicon. `sustainable_power` is therefore +simply an estimate, and may be tuned to affect the aggressiveness of +the thermal ramp. For reference, the sustainable power of a 4" phone +is typically 2000mW, while on a 10" tablet is around 4500mW (may vary +depending on screen size). + +If you are using device tree, do add it as a property of the +thermal-zone. For example:: + + thermal-zones { + soc_thermal { + polling-delay = <1000>; + polling-delay-passive = <100>; + sustainable-power = <2500>; + ... + +Instead, if the thermal zone is registered from the platform code, pass a +`thermal_zone_params` that has a `sustainable_power`. If no +`thermal_zone_params` were being passed, then something like below +will suffice:: + + static const struct thermal_zone_params tz_params = { + .sustainable_power = 3500, + }; + +and then pass `tz_params` as the 5th parameter to +`thermal_zone_device_register()` + +k_po and k_pu +------------- + +The implementation of the PID controller in the power allocator +thermal governor allows the configuration of two proportional term +constants: `k_po` and `k_pu`. `k_po` is the proportional term +constant during temperature overshoot periods (current temperature is +above "desired temperature" trip point). Conversely, `k_pu` is the +proportional term constant during temperature undershoot periods +(current temperature below "desired temperature" trip point). + +These controls are intended as the primary mechanism for configuring +the permitted thermal "ramp" of the system. For instance, a lower +`k_pu` value will provide a slower ramp, at the cost of capping +available capacity at a low temperature. On the other hand, a high +value of `k_pu` will result in the governor granting very high power +while temperature is low, and may lead to temperature overshooting. + +The default value for `k_pu` is:: + + 2 * sustainable_power / (desired_temperature - switch_on_temp) + +This means that at `switch_on_temp` the output of the controller's +proportional term will be 2 * `sustainable_power`. The default value +for `k_po` is:: + + sustainable_power / (desired_temperature - switch_on_temp) + +Focusing on the proportional and feed forward values of the PID +controller equation we have:: + + P_max = k_p * e + sustainable_power + +The proportional term is proportional to the difference between the +desired temperature and the current one. When the current temperature +is the desired one, then the proportional component is zero and +`P_max` = `sustainable_power`. That is, the system should operate in +thermal equilibrium under constant load. `sustainable_power` is only +an estimate, which is the reason for closed-loop control such as this. + +Expanding `k_pu` we get:: + + P_max = 2 * sustainable_power * (T_set - T) / (T_set - T_on) + + sustainable_power + +where: + + - T_set is the desired temperature + - T is the current temperature + - T_on is the switch on temperature + +When the current temperature is the switch_on temperature, the above +formula becomes:: + + P_max = 2 * sustainable_power * (T_set - T_on) / (T_set - T_on) + + sustainable_power = 2 * sustainable_power + sustainable_power = + 3 * sustainable_power + +Therefore, the proportional term alone linearly decreases power from +3 * `sustainable_power` to `sustainable_power` as the temperature +rises from the switch on temperature to the desired temperature. + +k_i and integral_cutoff +----------------------- + +`k_i` configures the PID loop's integral term constant. This term +allows the PID controller to compensate for long term drift and for +the quantized nature of the output control: cooling devices can't set +the exact power that the governor requests. When the temperature +error is below `integral_cutoff`, errors are accumulated in the +integral term. This term is then multiplied by `k_i` and the result +added to the output of the controller. Typically `k_i` is set low (1 +or 2) and `integral_cutoff` is 0. + +k_d +--- + +`k_d` configures the PID loop's derivative term constant. It's +recommended to leave it as the default: 0. + +Cooling device power API +======================== + +Cooling devices controlled by this governor must supply the additional +"power" API in their `cooling_device_ops`. It consists on three ops: + +1. :: + + int get_requested_power(struct thermal_cooling_device *cdev, + struct thermal_zone_device *tz, u32 *power); + + +@cdev: + The `struct thermal_cooling_device` pointer +@tz: + thermal zone in which we are currently operating +@power: + pointer in which to store the calculated power + +`get_requested_power()` calculates the power requested by the device +in milliwatts and stores it in @power . It should return 0 on +success, -E* on failure. This is currently used by the power +allocator governor to calculate how much power to give to each cooling +device. + +2. :: + + int state2power(struct thermal_cooling_device *cdev, struct + thermal_zone_device *tz, unsigned long state, + u32 *power); + +@cdev: + The `struct thermal_cooling_device` pointer +@tz: + thermal zone in which we are currently operating +@state: + A cooling device state +@power: + pointer in which to store the equivalent power + +Convert cooling device state @state into power consumption in +milliwatts and store it in @power. It should return 0 on success, -E* +on failure. This is currently used by thermal core to calculate the +maximum power that an actor can consume. + +3. :: + + int power2state(struct thermal_cooling_device *cdev, u32 power, + unsigned long *state); + +@cdev: + The `struct thermal_cooling_device` pointer +@power: + power in milliwatts +@state: + pointer in which to store the resulting state + +Calculate a cooling device state that would make the device consume at +most @power mW and store it in @state. It should return 0 on success, +-E* on failure. This is currently used by the thermal core to convert +a given power set by the power allocator governor to a state that the +cooling device can set. It is a function because this conversion may +depend on external factors that may change so this function should the +best conversion given "current circumstances". + +Cooling device weights +---------------------- + +Weights are a mechanism to bias the allocation among cooling +devices. They express the relative power efficiency of different +cooling devices. Higher weight can be used to express higher power +efficiency. Weighting is relative such that if each cooling device +has a weight of one they are considered equal. This is particularly +useful in heterogeneous systems where two cooling devices may perform +the same kind of compute, but with different efficiency. For example, +a system with two different types of processors. + +If the thermal zone is registered using +`thermal_zone_device_register()` (i.e., platform code), then weights +are passed as part of the thermal zone's `thermal_bind_parameters`. +If the platform is registered using device tree, then they are passed +as the `contribution` property of each map in the `cooling-maps` node. + +Limitations of the power allocator governor +=========================================== + +The power allocator governor's PID controller works best if there is a +periodic tick. If you have a driver that calls +`thermal_zone_device_update()` (or anything that ends up calling the +governor's `throttle()` function) repetitively, the governor response +won't be very good. Note that this is not particular to this +governor, step-wise will also misbehave if you call its throttle() +faster than the normal thermal framework tick (due to interrupts for +example) as it will overreact. diff --git a/Documentation/driver-api/thermal/sysfs-api.rst b/Documentation/driver-api/thermal/sysfs-api.rst new file mode 100644 index 000000000000..b40b1f839148 --- /dev/null +++ b/Documentation/driver-api/thermal/sysfs-api.rst @@ -0,0 +1,784 @@ +=================================== +Generic Thermal Sysfs driver How To +=================================== + +Written by Sujith Thomas <sujith.thomas@intel.com>, Zhang Rui <rui.zhang@intel.com> + +Updated: 2 January 2008 + +Copyright (c) 2008 Intel Corporation + + +0. Introduction +=============== + +The generic thermal sysfs provides a set of interfaces for thermal zone +devices (sensors) and thermal cooling devices (fan, processor...) to register +with the thermal management solution and to be a part of it. + +This how-to focuses on enabling new thermal zone and cooling devices to +participate in thermal management. +This solution is platform independent and any type of thermal zone devices +and cooling devices should be able to make use of the infrastructure. + +The main task of the thermal sysfs driver is to expose thermal zone attributes +as well as cooling device attributes to the user space. +An intelligent thermal management application can make decisions based on +inputs from thermal zone attributes (the current temperature and trip point +temperature) and throttle appropriate devices. + +- `[0-*]` denotes any positive number starting from 0 +- `[1-*]` denotes any positive number starting from 1 + +1. thermal sysfs driver interface functions +=========================================== + +1.1 thermal zone device interface +--------------------------------- + + :: + + struct thermal_zone_device + *thermal_zone_device_register(char *type, + int trips, int mask, void *devdata, + struct thermal_zone_device_ops *ops, + const struct thermal_zone_params *tzp, + int passive_delay, int polling_delay)) + + This interface function adds a new thermal zone device (sensor) to + /sys/class/thermal folder as `thermal_zone[0-*]`. It tries to bind all the + thermal cooling devices registered at the same time. + + type: + the thermal zone type. + trips: + the total number of trip points this thermal zone supports. + mask: + Bit string: If 'n'th bit is set, then trip point 'n' is writeable. + devdata: + device private data + ops: + thermal zone device call-backs. + + .bind: + bind the thermal zone device with a thermal cooling device. + .unbind: + unbind the thermal zone device with a thermal cooling device. + .get_temp: + get the current temperature of the thermal zone. + .set_trips: + set the trip points window. Whenever the current temperature + is updated, the trip points immediately below and above the + current temperature are found. + .get_mode: + get the current mode (enabled/disabled) of the thermal zone. + + - "enabled" means the kernel thermal management is + enabled. + - "disabled" will prevent kernel thermal driver action + upon trip points so that user applications can take + charge of thermal management. + .set_mode: + set the mode (enabled/disabled) of the thermal zone. + .get_trip_type: + get the type of certain trip point. + .get_trip_temp: + get the temperature above which the certain trip point + will be fired. + .set_emul_temp: + set the emulation temperature which helps in debugging + different threshold temperature points. + tzp: + thermal zone platform parameters. + passive_delay: + number of milliseconds to wait between polls when + performing passive cooling. + polling_delay: + number of milliseconds to wait between polls when checking + whether trip points have been crossed (0 for interrupt driven systems). + + :: + + void thermal_zone_device_unregister(struct thermal_zone_device *tz) + + This interface function removes the thermal zone device. + It deletes the corresponding entry from /sys/class/thermal folder and + unbinds all the thermal cooling devices it uses. + + :: + + struct thermal_zone_device + *thermal_zone_of_sensor_register(struct device *dev, int sensor_id, + void *data, + const struct thermal_zone_of_device_ops *ops) + + This interface adds a new sensor to a DT thermal zone. + This function will search the list of thermal zones described in + device tree and look for the zone that refer to the sensor device + pointed by dev->of_node as temperature providers. For the zone + pointing to the sensor node, the sensor will be added to the DT + thermal zone device. + + The parameters for this interface are: + + dev: + Device node of sensor containing valid node pointer in + dev->of_node. + sensor_id: + a sensor identifier, in case the sensor IP has more + than one sensors + data: + a private pointer (owned by the caller) that will be + passed back, when a temperature reading is needed. + ops: + `struct thermal_zone_of_device_ops *`. + + ============== ======================================= + get_temp a pointer to a function that reads the + sensor temperature. This is mandatory + callback provided by sensor driver. + set_trips a pointer to a function that sets a + temperature window. When this window is + left the driver must inform the thermal + core via thermal_zone_device_update. + get_trend a pointer to a function that reads the + sensor temperature trend. + set_emul_temp a pointer to a function that sets + sensor emulated temperature. + ============== ======================================= + + The thermal zone temperature is provided by the get_temp() function + pointer of thermal_zone_of_device_ops. When called, it will + have the private pointer @data back. + + It returns error pointer if fails otherwise valid thermal zone device + handle. Caller should check the return handle with IS_ERR() for finding + whether success or not. + + :: + + void thermal_zone_of_sensor_unregister(struct device *dev, + struct thermal_zone_device *tzd) + + This interface unregisters a sensor from a DT thermal zone which was + successfully added by interface thermal_zone_of_sensor_register(). + This function removes the sensor callbacks and private data from the + thermal zone device registered with thermal_zone_of_sensor_register() + interface. It will also silent the zone by remove the .get_temp() and + get_trend() thermal zone device callbacks. + + :: + + struct thermal_zone_device + *devm_thermal_zone_of_sensor_register(struct device *dev, + int sensor_id, + void *data, + const struct thermal_zone_of_device_ops *ops) + + This interface is resource managed version of + thermal_zone_of_sensor_register(). + + All details of thermal_zone_of_sensor_register() described in + section 1.1.3 is applicable here. + + The benefit of using this interface to register sensor is that it + is not require to explicitly call thermal_zone_of_sensor_unregister() + in error path or during driver unbinding as this is done by driver + resource manager. + + :: + + void devm_thermal_zone_of_sensor_unregister(struct device *dev, + struct thermal_zone_device *tzd) + + This interface is resource managed version of + thermal_zone_of_sensor_unregister(). + All details of thermal_zone_of_sensor_unregister() described in + section 1.1.4 is applicable here. + Normally this function will not need to be called and the resource + management code will ensure that the resource is freed. + + :: + + int thermal_zone_get_slope(struct thermal_zone_device *tz) + + This interface is used to read the slope attribute value + for the thermal zone device, which might be useful for platform + drivers for temperature calculations. + + :: + + int thermal_zone_get_offset(struct thermal_zone_device *tz) + + This interface is used to read the offset attribute value + for the thermal zone device, which might be useful for platform + drivers for temperature calculations. + +1.2 thermal cooling device interface +------------------------------------ + + + :: + + struct thermal_cooling_device + *thermal_cooling_device_register(char *name, + void *devdata, struct thermal_cooling_device_ops *) + + This interface function adds a new thermal cooling device (fan/processor/...) + to /sys/class/thermal/ folder as `cooling_device[0-*]`. It tries to bind itself + to all the thermal zone devices registered at the same time. + + name: + the cooling device name. + devdata: + device private data. + ops: + thermal cooling devices call-backs. + + .get_max_state: + get the Maximum throttle state of the cooling device. + .get_cur_state: + get the Currently requested throttle state of the + cooling device. + .set_cur_state: + set the Current throttle state of the cooling device. + + :: + + void thermal_cooling_device_unregister(struct thermal_cooling_device *cdev) + + This interface function removes the thermal cooling device. + It deletes the corresponding entry from /sys/class/thermal folder and + unbinds itself from all the thermal zone devices using it. + +1.3 interface for binding a thermal zone device with a thermal cooling device +----------------------------------------------------------------------------- + + :: + + int thermal_zone_bind_cooling_device(struct thermal_zone_device *tz, + int trip, struct thermal_cooling_device *cdev, + unsigned long upper, unsigned long lower, unsigned int weight); + + This interface function binds a thermal cooling device to a particular trip + point of a thermal zone device. + + This function is usually called in the thermal zone device .bind callback. + + tz: + the thermal zone device + cdev: + thermal cooling device + trip: + indicates which trip point in this thermal zone the cooling device + is associated with. + upper: + the Maximum cooling state for this trip point. + THERMAL_NO_LIMIT means no upper limit, + and the cooling device can be in max_state. + lower: + the Minimum cooling state can be used for this trip point. + THERMAL_NO_LIMIT means no lower limit, + and the cooling device can be in cooling state 0. + weight: + the influence of this cooling device in this thermal + zone. See 1.4.1 below for more information. + + :: + + int thermal_zone_unbind_cooling_device(struct thermal_zone_device *tz, + int trip, struct thermal_cooling_device *cdev); + + This interface function unbinds a thermal cooling device from a particular + trip point of a thermal zone device. This function is usually called in + the thermal zone device .unbind callback. + + tz: + the thermal zone device + cdev: + thermal cooling device + trip: + indicates which trip point in this thermal zone the cooling device + is associated with. + +1.4 Thermal Zone Parameters +--------------------------- + + :: + + struct thermal_bind_params + + This structure defines the following parameters that are used to bind + a zone with a cooling device for a particular trip point. + + .cdev: + The cooling device pointer + .weight: + The 'influence' of a particular cooling device on this + zone. This is relative to the rest of the cooling + devices. For example, if all cooling devices have a + weight of 1, then they all contribute the same. You can + use percentages if you want, but it's not mandatory. A + weight of 0 means that this cooling device doesn't + contribute to the cooling of this zone unless all cooling + devices have a weight of 0. If all weights are 0, then + they all contribute the same. + .trip_mask: + This is a bit mask that gives the binding relation between + this thermal zone and cdev, for a particular trip point. + If nth bit is set, then the cdev and thermal zone are bound + for trip point n. + .binding_limits: + This is an array of cooling state limits. Must have + exactly 2 * thermal_zone.number_of_trip_points. It is an + array consisting of tuples <lower-state upper-state> of + state limits. Each trip will be associated with one state + limit tuple when binding. A NULL pointer means + <THERMAL_NO_LIMITS THERMAL_NO_LIMITS> on all trips. + These limits are used when binding a cdev to a trip point. + .match: + This call back returns success(0) if the 'tz and cdev' need to + be bound, as per platform data. + + :: + + struct thermal_zone_params + + This structure defines the platform level parameters for a thermal zone. + This data, for each thermal zone should come from the platform layer. + This is an optional feature where some platforms can choose not to + provide this data. + + .governor_name: + Name of the thermal governor used for this zone + .no_hwmon: + a boolean to indicate if the thermal to hwmon sysfs interface + is required. when no_hwmon == false, a hwmon sysfs interface + will be created. when no_hwmon == true, nothing will be done. + In case the thermal_zone_params is NULL, the hwmon interface + will be created (for backward compatibility). + .num_tbps: + Number of thermal_bind_params entries for this zone + .tbp: + thermal_bind_params entries + +2. sysfs attributes structure +============================= + +== ================ +RO read only value +WO write only value +RW read/write value +== ================ + +Thermal sysfs attributes will be represented under /sys/class/thermal. +Hwmon sysfs I/F extension is also available under /sys/class/hwmon +if hwmon is compiled in or built as a module. + +Thermal zone device sys I/F, created once it's registered:: + + /sys/class/thermal/thermal_zone[0-*]: + |---type: Type of the thermal zone + |---temp: Current temperature + |---mode: Working mode of the thermal zone + |---policy: Thermal governor used for this zone + |---available_policies: Available thermal governors for this zone + |---trip_point_[0-*]_temp: Trip point temperature + |---trip_point_[0-*]_type: Trip point type + |---trip_point_[0-*]_hyst: Hysteresis value for this trip point + |---emul_temp: Emulated temperature set node + |---sustainable_power: Sustainable dissipatable power + |---k_po: Proportional term during temperature overshoot + |---k_pu: Proportional term during temperature undershoot + |---k_i: PID's integral term in the power allocator gov + |---k_d: PID's derivative term in the power allocator + |---integral_cutoff: Offset above which errors are accumulated + |---slope: Slope constant applied as linear extrapolation + |---offset: Offset constant applied as linear extrapolation + +Thermal cooling device sys I/F, created once it's registered:: + + /sys/class/thermal/cooling_device[0-*]: + |---type: Type of the cooling device(processor/fan/...) + |---max_state: Maximum cooling state of the cooling device + |---cur_state: Current cooling state of the cooling device + |---stats: Directory containing cooling device's statistics + |---stats/reset: Writing any value resets the statistics + |---stats/time_in_state_ms: Time (msec) spent in various cooling states + |---stats/total_trans: Total number of times cooling state is changed + |---stats/trans_table: Cooing state transition table + + +Then next two dynamic attributes are created/removed in pairs. They represent +the relationship between a thermal zone and its associated cooling device. +They are created/removed for each successful execution of +thermal_zone_bind_cooling_device/thermal_zone_unbind_cooling_device. + +:: + + /sys/class/thermal/thermal_zone[0-*]: + |---cdev[0-*]: [0-*]th cooling device in current thermal zone + |---cdev[0-*]_trip_point: Trip point that cdev[0-*] is associated with + |---cdev[0-*]_weight: Influence of the cooling device in + this thermal zone + +Besides the thermal zone device sysfs I/F and cooling device sysfs I/F, +the generic thermal driver also creates a hwmon sysfs I/F for each _type_ +of thermal zone device. E.g. the generic thermal driver registers one hwmon +class device and build the associated hwmon sysfs I/F for all the registered +ACPI thermal zones. + +:: + + /sys/class/hwmon/hwmon[0-*]: + |---name: The type of the thermal zone devices + |---temp[1-*]_input: The current temperature of thermal zone [1-*] + |---temp[1-*]_critical: The critical trip point of thermal zone [1-*] + +Please read Documentation/hwmon/sysfs-interface.rst for additional information. + +Thermal zone attributes +----------------------- + +type + Strings which represent the thermal zone type. + This is given by thermal zone driver as part of registration. + E.g: "acpitz" indicates it's an ACPI thermal device. + In order to keep it consistent with hwmon sys attribute; this should + be a short, lowercase string, not containing spaces nor dashes. + RO, Required + +temp + Current temperature as reported by thermal zone (sensor). + Unit: millidegree Celsius + RO, Required + +mode + One of the predefined values in [enabled, disabled]. + This file gives information about the algorithm that is currently + managing the thermal zone. It can be either default kernel based + algorithm or user space application. + + enabled + enable Kernel Thermal management. + disabled + Preventing kernel thermal zone driver actions upon + trip points so that user application can take full + charge of the thermal management. + + RW, Optional + +policy + One of the various thermal governors used for a particular zone. + + RW, Required + +available_policies + Available thermal governors which can be used for a particular zone. + + RO, Required + +`trip_point_[0-*]_temp` + The temperature above which trip point will be fired. + + Unit: millidegree Celsius + + RO, Optional + +`trip_point_[0-*]_type` + Strings which indicate the type of the trip point. + + E.g. it can be one of critical, hot, passive, `active[0-*]` for ACPI + thermal zone. + + RO, Optional + +`trip_point_[0-*]_hyst` + The hysteresis value for a trip point, represented as an integer + Unit: Celsius + RW, Optional + +`cdev[0-*]` + Sysfs link to the thermal cooling device node where the sys I/F + for cooling device throttling control represents. + + RO, Optional + +`cdev[0-*]_trip_point` + The trip point in this thermal zone which `cdev[0-*]` is associated + with; -1 means the cooling device is not associated with any trip + point. + + RO, Optional + +`cdev[0-*]_weight` + The influence of `cdev[0-*]` in this thermal zone. This value + is relative to the rest of cooling devices in the thermal + zone. For example, if a cooling device has a weight double + than that of other, it's twice as effective in cooling the + thermal zone. + + RW, Optional + +passive + Attribute is only present for zones in which the passive cooling + policy is not supported by native thermal driver. Default is zero + and can be set to a temperature (in millidegrees) to enable a + passive trip point for the zone. Activation is done by polling with + an interval of 1 second. + + Unit: millidegrees Celsius + + Valid values: 0 (disabled) or greater than 1000 + + RW, Optional + +emul_temp + Interface to set the emulated temperature method in thermal zone + (sensor). After setting this temperature, the thermal zone may pass + this temperature to platform emulation function if registered or + cache it locally. This is useful in debugging different temperature + threshold and its associated cooling action. This is write only node + and writing 0 on this node should disable emulation. + Unit: millidegree Celsius + + WO, Optional + + WARNING: + Be careful while enabling this option on production systems, + because userland can easily disable the thermal policy by simply + flooding this sysfs node with low temperature values. + +sustainable_power + An estimate of the sustained power that can be dissipated by + the thermal zone. Used by the power allocator governor. For + more information see Documentation/driver-api/thermal/power_allocator.rst + + Unit: milliwatts + + RW, Optional + +k_po + The proportional term of the power allocator governor's PID + controller during temperature overshoot. Temperature overshoot + is when the current temperature is above the "desired + temperature" trip point. For more information see + Documentation/driver-api/thermal/power_allocator.rst + + RW, Optional + +k_pu + The proportional term of the power allocator governor's PID + controller during temperature undershoot. Temperature undershoot + is when the current temperature is below the "desired + temperature" trip point. For more information see + Documentation/driver-api/thermal/power_allocator.rst + + RW, Optional + +k_i + The integral term of the power allocator governor's PID + controller. This term allows the PID controller to compensate + for long term drift. For more information see + Documentation/driver-api/thermal/power_allocator.rst + + RW, Optional + +k_d + The derivative term of the power allocator governor's PID + controller. For more information see + Documentation/driver-api/thermal/power_allocator.rst + + RW, Optional + +integral_cutoff + Temperature offset from the desired temperature trip point + above which the integral term of the power allocator + governor's PID controller starts accumulating errors. For + example, if integral_cutoff is 0, then the integral term only + accumulates error when temperature is above the desired + temperature trip point. For more information see + Documentation/driver-api/thermal/power_allocator.rst + + Unit: millidegree Celsius + + RW, Optional + +slope + The slope constant used in a linear extrapolation model + to determine a hotspot temperature based off the sensor's + raw readings. It is up to the device driver to determine + the usage of these values. + + RW, Optional + +offset + The offset constant used in a linear extrapolation model + to determine a hotspot temperature based off the sensor's + raw readings. It is up to the device driver to determine + the usage of these values. + + RW, Optional + +Cooling device attributes +------------------------- + +type + String which represents the type of device, e.g: + + - for generic ACPI: should be "Fan", "Processor" or "LCD" + - for memory controller device on intel_menlow platform: + should be "Memory controller". + + RO, Required + +max_state + The maximum permissible cooling state of this cooling device. + + RO, Required + +cur_state + The current cooling state of this cooling device. + The value can any integer numbers between 0 and max_state: + + - cur_state == 0 means no cooling + - cur_state == max_state means the maximum cooling. + + RW, Required + +stats/reset + Writing any value resets the cooling device's statistics. + WO, Required + +stats/time_in_state_ms: + The amount of time spent by the cooling device in various cooling + states. The output will have "<state> <time>" pair in each line, which + will mean this cooling device spent <time> msec of time at <state>. + Output will have one line for each of the supported states. usertime + units here is 10mS (similar to other time exported in /proc). + RO, Required + + +stats/total_trans: + A single positive value showing the total number of times the state of a + cooling device is changed. + + RO, Required + +stats/trans_table: + This gives fine grained information about all the cooling state + transitions. The cat output here is a two dimensional matrix, where an + entry <i,j> (row i, column j) represents the number of transitions from + State_i to State_j. If the transition table is bigger than PAGE_SIZE, + reading this will return an -EFBIG error. + RO, Required + +3. A simple implementation +========================== + +ACPI thermal zone may support multiple trip points like critical, hot, +passive, active. If an ACPI thermal zone supports critical, passive, +active[0] and active[1] at the same time, it may register itself as a +thermal_zone_device (thermal_zone1) with 4 trip points in all. +It has one processor and one fan, which are both registered as +thermal_cooling_device. Both are considered to have the same +effectiveness in cooling the thermal zone. + +If the processor is listed in _PSL method, and the fan is listed in _AL0 +method, the sys I/F structure will be built like this:: + + /sys/class/thermal: + |thermal_zone1: + |---type: acpitz + |---temp: 37000 + |---mode: enabled + |---policy: step_wise + |---available_policies: step_wise fair_share + |---trip_point_0_temp: 100000 + |---trip_point_0_type: critical + |---trip_point_1_temp: 80000 + |---trip_point_1_type: passive + |---trip_point_2_temp: 70000 + |---trip_point_2_type: active0 + |---trip_point_3_temp: 60000 + |---trip_point_3_type: active1 + |---cdev0: --->/sys/class/thermal/cooling_device0 + |---cdev0_trip_point: 1 /* cdev0 can be used for passive */ + |---cdev0_weight: 1024 + |---cdev1: --->/sys/class/thermal/cooling_device3 + |---cdev1_trip_point: 2 /* cdev1 can be used for active[0]*/ + |---cdev1_weight: 1024 + + |cooling_device0: + |---type: Processor + |---max_state: 8 + |---cur_state: 0 + + |cooling_device3: + |---type: Fan + |---max_state: 2 + |---cur_state: 0 + + /sys/class/hwmon: + |hwmon0: + |---name: acpitz + |---temp1_input: 37000 + |---temp1_crit: 100000 + +4. Export Symbol APIs +===================== + +4.1. get_tz_trend +----------------- + +This function returns the trend of a thermal zone, i.e the rate of change +of temperature of the thermal zone. Ideally, the thermal sensor drivers +are supposed to implement the callback. If they don't, the thermal +framework calculated the trend by comparing the previous and the current +temperature values. + +4.2. get_thermal_instance +------------------------- + +This function returns the thermal_instance corresponding to a given +{thermal_zone, cooling_device, trip_point} combination. Returns NULL +if such an instance does not exist. + +4.3. thermal_notify_framework +----------------------------- + +This function handles the trip events from sensor drivers. It starts +throttling the cooling devices according to the policy configured. +For CRITICAL and HOT trip points, this notifies the respective drivers, +and does actual throttling for other trip points i.e ACTIVE and PASSIVE. +The throttling policy is based on the configured platform data; if no +platform data is provided, this uses the step_wise throttling policy. + +4.4. thermal_cdev_update +------------------------ + +This function serves as an arbitrator to set the state of a cooling +device. It sets the cooling device to the deepest cooling state if +possible. + +5. thermal_emergency_poweroff +============================= + +On an event of critical trip temperature crossing. Thermal framework +allows the system to shutdown gracefully by calling orderly_poweroff(). +In the event of a failure of orderly_poweroff() to shut down the system +we are in danger of keeping the system alive at undesirably high +temperatures. To mitigate this high risk scenario we program a work +queue to fire after a pre-determined number of seconds to start +an emergency shutdown of the device using the kernel_power_off() +function. In case kernel_power_off() fails then finally +emergency_restart() is called in the worst case. + +The delay should be carefully profiled so as to give adequate time for +orderly_poweroff(). In case of failure of an orderly_poweroff() the +emergency poweroff kicks in after the delay has elapsed and shuts down +the system. + +If set to 0 emergency poweroff will not be supported. So a carefully +profiled non-zero positive value is a must for emergerncy poweroff to be +triggered. diff --git a/Documentation/driver-api/thermal/x86_pkg_temperature_thermal.rst b/Documentation/driver-api/thermal/x86_pkg_temperature_thermal.rst new file mode 100644 index 000000000000..2ac42ccd236f --- /dev/null +++ b/Documentation/driver-api/thermal/x86_pkg_temperature_thermal.rst @@ -0,0 +1,55 @@ +=================================== +Kernel driver: x86_pkg_temp_thermal +=================================== + +Supported chips: + +* x86: with package level thermal management + +(Verify using: CPUID.06H:EAX[bit 6] =1) + +Authors: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> + +Reference +--------- + +Intel® 64 and IA-32 Architectures Software Developer’s Manual (Jan, 2013): +Chapter 14.6: PACKAGE LEVEL THERMAL MANAGEMENT + +Description +----------- + +This driver register CPU digital temperature package level sensor as a thermal +zone with maximum two user mode configurable trip points. Number of trip points +depends on the capability of the package. Once the trip point is violated, +user mode can receive notification via thermal notification mechanism and can +take any action to control temperature. + + +Threshold management +-------------------- +Each package will register as a thermal zone under /sys/class/thermal. + +Example:: + + /sys/class/thermal/thermal_zone1 + +This contains two trip points: + +- trip_point_0_temp +- trip_point_1_temp + +User can set any temperature between 0 to TJ-Max temperature. Temperature units +are in milli-degree Celsius. Refer to "Documentation/driver-api/thermal/sysfs-api.rst" for +thermal sys-fs details. + +Any value other than 0 in these trip points, can trigger thermal notifications. +Setting 0, stops sending thermal notifications. + +Thermal notifications: +To get kobject-uevent notifications, set the thermal zone +policy to "user_space". + +For example:: + + echo -n "user_space" > policy diff --git a/Documentation/driver-api/uio-howto.rst b/Documentation/driver-api/uio-howto.rst index 8fecfa11d4ff..84091cd25dc4 100644 --- a/Documentation/driver-api/uio-howto.rst +++ b/Documentation/driver-api/uio-howto.rst @@ -408,6 +408,13 @@ handler code. You also do not need to know anything about the chip's internal registers to create the kernel part of the driver. All you need to know is the irq number of the pin the chip is connected to. +When used in a device-tree enabled system, the driver needs to be +probed with the ``"of_id"`` module parameter set to the ``"compatible"`` +string of the node the driver is supposed to handle. By default, the +node's name (without the unit address) is exposed as name for the +UIO device in userspace. To set a custom name, a property named +``"linux,uio-name"`` may be specified in the DT node. + Using uio_dmem_genirq for platform devices ------------------------------------------ |